-
Notifications
You must be signed in to change notification settings - Fork 945
Gossmap crash, issues #8053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gossmap crash, issues #8053
Conversation
We have reports of crashes on reading gossip_store, including from gossipd itself! ``` lightning_gossipd: common/gossmap.c:121: map_copy: Assertion `offset + len <= map->map_size' failed. ... lightning_gossipd: FATAL SIGNAL (version v24.11) 0x6260c41d682a send_backtrace common/daemon.c:33 0x6260c41e098b status_failed common/status.c:221 0x6260c41e0b41 status_backtrace_exit common/subdaemon.c:18 0x6260c41d68b8 crashdump common/daemon.c:78 0x70508ea6913f ??? ???:0 0x70508e8a0d51 ??? ???:0 0x70508e88a536 ??? ???:0 0x70508e88a40e ??? ???:0 0x70508e8996d1 ??? ???:0 0x6260c41d8b69 map_copy common/gossmap.c:121 0x6260c41d8bab map_be16 common/gossmap.c:142 0x6260c41daa45 map_catchup common/gossmap.c:705 0x6260c41dab95 gossmap_refresh_mayfail common/gossmap.c:1192 0x6260c41daca6 gossmap_refresh common/gossmap.c:1213 0x6260c41cee32 gossmap_manage_get_gossmap gossipd/gossmap_manage.c:1314 0x6260c41d0686 gossmap_manage_new_block gossipd/gossmap_manage.c:1221 0x6260c41cbfdd new_blockheight gossipd/gossipd.c:473 0x6260c41cc363 recv_req gossipd/gossipd.c:584 0x6260c41d6b1d handle_read common/daemon_conn.c:35 0x6260c43175b5 next_plan ccan/ccan/io/io.c:60 0x6260c4317a40 do_plan ccan/ccan/io/io.c:422 0x6260c4317af9 io_ready ccan/ccan/io/io.c:439 0x6260c4319446 io_loop ccan/ccan/io/poll.c:455 0x6260c41cccf4 main gossipd/gossipd.c:665 ``` This implies that we have a message shorter than 2 bytes, which should never happen. An audit didn't shed any light, but let's make sure we don't ever write such a thing. Signed-off-by: Rusty Russell <[email protected]>
ebf78bc
to
d32b9e5
Compare
We have a report of this happening under ZFS. We cannot do much if this really is a problem where we can't read back what we write, but this avoids the immediate crash. Fixes: ElementsProject#7971 Signed-off-by: Rusty Russell <[email protected]> Changelog-Fixed: gossmap: occasional crash (at least on ZFS) reading gossip_store.
d32b9e5
to
15f5e4c
Compare
We only use it in one place, and that was simply to share an fd between gossipd writing and gossipd reading, which may be causing our zfs problem anyway. In fact, it fixes a race if we don't have HAVE_PWRITEV. Signed-off-by: Rusty Russell <[email protected]>
Default goes to stderr for LOG_UNUSUAL and higher. We have to whitelist more cases in map_catchup so we don't spam the logs with perfectly-expected (but ignored) messages though. Signed-off-by: Rusty Russell <[email protected]>
While this shouldn't happen, it does (pending other fixes), and we stop reading the gossip store until next time. The result is partial gossip, demonstrated beautifully by NicolasDorier's report: ``` lightning_gossipd: gossmap: redundant channel_announce for 864063x1306x1, offsets 1272259 and 1784859!" ``` Gossipd stalld there and don't make more progress. So gossipd itself doesn't see the entire gossip_store. Then things get really batshit: ``` 2025-02-04T05:53:28.582Z DEBUG gossipd: Store compact time: 1429910 msec ``` This took 1429 seconds to process. Why? Because it hasn't been processing the gossip store fully, gossipd kept adding "new" records to the end: ``` 2025-02-04T05:53:28.583Z DEBUG gossipd: gossip_store: Read 62716143/1739952/5158256/0 cannounce/cupdate/nannounce/delete from store in 31634458462 bytes, now 31634458440 bytes (populated=true) ``` It has 31GB of gossip in there! No wonder it took so long... Signed-off-by: Rusty Russell <[email protected]> Fixes: ElementsProject#8035 Changelog-Fixed: gossipd: corruption in the gossip_store could cause ever-longer startup times and no gossip updates.
We're about to test them in gossmap. Signed-off-by: Rusty Russell <[email protected]>
We assume if it's incorrect, we simply need to wait. If this proves incorrect, we will see a stream of BROKEN log messages. To measure the performance impact, I timed tests/test_askrene.py::test_real_biases on my laptop. Before: 194.52s After: 202.81s So it's marginal. Signed-off-by: Rusty Russell <[email protected]>
Instead of making a copy. To measure the performance impact, I timed tests/test_askrene.py::test_real_biases on my laptop. No checksum check: 194.52s Copying for checksum check: 202.81s Zero-copy checksum check: 194.40s But these numbers proved noisy. Still, doesn't hurt. Signed-off-by: Rusty Russell <[email protected]>
If they go to stderr, you can't associate them with the record they're talking about. Signed-off-by: Rusty Russell <[email protected]>
15f5e4c
to
7ad3192
Compare
It's actually the only one that uses it. We also tweak the way gossip_store handles failure: gossmap_manage now tells it when to reset the corrupted store. Signed-off-by: Rusty Russell <[email protected]>
This means we will correctly reset the store if it has redundant records, for example. Signed-off-by: Rusty Russell <[email protected]>
7ad3192
to
cec4174
Compare
That's actually caught by the gossmap load now. Signed-off-by: Rusty Russell <[email protected]>
Found by inspection. Minor bug, since we'll catch it on the next block, but annoying. Signed-off-by: Rusty Russell <[email protected]>
We had at least one report of overwriting the gossip_store file at offset 1. Make sure this doesn't happen. Signed-off-by: Rusty Russell <[email protected]>
After analyzing various weird cases where we ended up with duplicate gossip_store entries, it could be explained by us not fully processing the gossip store. It's not clear that my assumptions that we would always see our own writes are true: technically this may require an fsync(). So we now add the check, and do an fsync and try again. Signed-off-by: Rusty Russell <[email protected]> Changelog-Fixed: gossipd: more sanity checks that we are correctly updating the gossip_store file.
``` 2025-01-23T12:31:52.528Z DEBUG gossipd: Pruning channel 839050x1246x0 from network view (ages 1736283379 and 1737600120) 2025-01-27T00:32:01.631Z DEBUG gossipd: Pruning channel 839050x1246x0 from network view (ages 0 and 1737686520) 2025-01-27T00:50:05.998Z **BROKEN** gossipd: Dying channel 839050x1246x0 already deleted? ``` Easiest not to prune in this case. Signed-off-by: Rusty Russell <[email protected]>
cec4174
to
023ec62
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. I'm looking forward to trying it out. I like how our test teardowns will do more validation via the new logcb and broken messages as well. Big debugging win if the included fixes don't already resolve our issues.
ACK 023ec62
@@ -522,6 +522,12 @@ struct plugin_timer *command_timer_(struct command *cmd, | |||
void plugin_log(struct plugin *p, enum log_level l, const char *fmt, ...) PRINTF_FMT(3, 4); | |||
void plugin_logv(struct plugin *p, enum log_level l, const char *fmt, va_list ap); | |||
|
|||
/* plogin_logv wrapper for gossmap */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* plogin_logv wrapper for gossmap */ | |
/* plugin_logv wrapper for gossmap */ |
* @gs: the gossip_store. | ||
* | ||
* Used by gossmap_manage to create a gossmap. | ||
* Move the old gossip store out the way. Log a broken message about it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Move the old gossip store out the way. Log a broken message about it. | |
* Move the old gossip store out of the way. Log a broken message about it. |
(cbarg)) | ||
|
||
/* If we're the author of the gossmap, it should have no redundant records, corruption, etc. | ||
* So this fails if that's not the case. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making Jon Postel proud! This is cool solution to what we were missing when gossmap took over the gossip_store.
Fixes #7971