Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gossmap crash, issues #8053

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

rustyrussell
Copy link
Contributor

Fixes #7971

We have reports of crashes on reading gossip_store, including from gossipd itself!

```
lightning_gossipd: common/gossmap.c:121: map_copy: Assertion `offset + len <= map->map_size' failed.
...
lightning_gossipd: FATAL SIGNAL (version v24.11)
0x6260c41d682a send_backtrace
  common/daemon.c:33
0x6260c41e098b status_failed
  common/status.c:221
0x6260c41e0b41 status_backtrace_exit
  common/subdaemon.c:18
0x6260c41d68b8 crashdump
  common/daemon.c:78
0x70508ea6913f ???
  ???:0
0x70508e8a0d51 ???
  ???:0
0x70508e88a536 ???
  ???:0
0x70508e88a40e ???
  ???:0
0x70508e8996d1 ???
  ???:0
0x6260c41d8b69 map_copy
  common/gossmap.c:121
0x6260c41d8bab map_be16
  common/gossmap.c:142
0x6260c41daa45 map_catchup
  common/gossmap.c:705
0x6260c41dab95 gossmap_refresh_mayfail
  common/gossmap.c:1192
0x6260c41daca6 gossmap_refresh
  common/gossmap.c:1213
0x6260c41cee32 gossmap_manage_get_gossmap
  gossipd/gossmap_manage.c:1314
0x6260c41d0686 gossmap_manage_new_block
  gossipd/gossmap_manage.c:1221
0x6260c41cbfdd new_blockheight
  gossipd/gossipd.c:473
0x6260c41cc363 recv_req
  gossipd/gossipd.c:584
0x6260c41d6b1d handle_read
  common/daemon_conn.c:35
0x6260c43175b5 next_plan
  ccan/ccan/io/io.c:60
0x6260c4317a40 do_plan
  ccan/ccan/io/io.c:422
0x6260c4317af9 io_ready
  ccan/ccan/io/io.c:439
0x6260c4319446 io_loop
  ccan/ccan/io/poll.c:455
0x6260c41cccf4 main
  gossipd/gossipd.c:665
```

This implies that we have a message shorter than 2 bytes, which should never happen.

An audit didn't shed any light, but let's make sure we don't ever write such a thing.

Signed-off-by: Rusty Russell <[email protected]>
@rustyrussell rustyrussell added this to the v25.02 milestone Feb 5, 2025
@rustyrussell rustyrussell force-pushed the guilt/gossmap-crash2 branch 3 times, most recently from ebf78bc to d32b9e5 Compare February 5, 2025 23:33
We have a report of this happening under ZFS.  We cannot do much if
this really is a problem where we can't read back what we write, but
this avoids the immediate crash.

Fixes: ElementsProject#7971
Signed-off-by: Rusty Russell <[email protected]>
Changelog-Fixed: gossmap: occasional crash (at least on ZFS) reading gossip_store.
We only use it in one place, and that was simply to share an fd between
gossipd writing and gossipd reading, which may be causing our zfs problem
anyway.

In fact, it fixes a race if we don't have HAVE_PWRITEV.

Signed-off-by: Rusty Russell <[email protected]>
Default goes to stderr for LOG_UNUSUAL and higher.

Signed-off-by: Rusty Russell <[email protected]>
We're about to test them in gossmap.

Signed-off-by: Rusty Russell <[email protected]>
We assume if it's incorrect, we simply need to wait.  If this proves incorrect,
we will see a stream of BROKEN log messages.

To measure the performance impact, I timed
tests/test_askrene.py::test_real_biases on my laptop.

	Before: 194.52s
	After: 202.81s

So it's marginal.

Signed-off-by: Rusty Russell <[email protected]>
Instead of making a copy.

To measure the performance impact, I timed
tests/test_askrene.py::test_real_biases on my laptop.

	No checksum check: 194.52s
	Copying for checksum check: 202.81s
	Zero-copy checksum check: 194.40s

But these numbers proved noisy.  Still, doesn't hurt.

Signed-off-by: Rusty Russell <[email protected]>
If they go to stderr, you can't associate them with the record they're
talking about.

Signed-off-by: Rusty Russell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CLN crash and gossip store corruption
1 participant