Gossmap crash, issues #8053

rustyrussell · 2025-02-05T02:49:23Z

We have reports of crashes on reading gossip_store, including from gossipd itself! ``` lightning_gossipd: common/gossmap.c:121: map_copy: Assertion `offset + len <= map->map_size' failed. ... lightning_gossipd: FATAL SIGNAL (version v24.11) 0x6260c41d682a send_backtrace common/daemon.c:33 0x6260c41e098b status_failed common/status.c:221 0x6260c41e0b41 status_backtrace_exit common/subdaemon.c:18 0x6260c41d68b8 crashdump common/daemon.c:78 0x70508ea6913f ??? ???:0 0x70508e8a0d51 ??? ???:0 0x70508e88a536 ??? ???:0 0x70508e88a40e ??? ???:0 0x70508e8996d1 ??? ???:0 0x6260c41d8b69 map_copy common/gossmap.c:121 0x6260c41d8bab map_be16 common/gossmap.c:142 0x6260c41daa45 map_catchup common/gossmap.c:705 0x6260c41dab95 gossmap_refresh_mayfail common/gossmap.c:1192 0x6260c41daca6 gossmap_refresh common/gossmap.c:1213 0x6260c41cee32 gossmap_manage_get_gossmap gossipd/gossmap_manage.c:1314 0x6260c41d0686 gossmap_manage_new_block gossipd/gossmap_manage.c:1221 0x6260c41cbfdd new_blockheight gossipd/gossipd.c:473 0x6260c41cc363 recv_req gossipd/gossipd.c:584 0x6260c41d6b1d handle_read common/daemon_conn.c:35 0x6260c43175b5 next_plan ccan/ccan/io/io.c:60 0x6260c4317a40 do_plan ccan/ccan/io/io.c:422 0x6260c4317af9 io_ready ccan/ccan/io/io.c:439 0x6260c4319446 io_loop ccan/ccan/io/poll.c:455 0x6260c41cccf4 main gossipd/gossipd.c:665 ``` This implies that we have a message shorter than 2 bytes, which should never happen. An audit didn't shed any light, but let's make sure we don't ever write such a thing. Signed-off-by: Rusty Russell <[email protected]>

We have a report of this happening under ZFS. We cannot do much if this really is a problem where we can't read back what we write, but this avoids the immediate crash. Fixes: ElementsProject#7971 Signed-off-by: Rusty Russell <[email protected]> Changelog-Fixed: gossmap: occasional crash (at least on ZFS) reading gossip_store.

We only use it in one place, and that was simply to share an fd between gossipd writing and gossipd reading, which may be causing our zfs problem anyway. In fact, it fixes a race if we don't have HAVE_PWRITEV. Signed-off-by: Rusty Russell <[email protected]>

Default goes to stderr for LOG_UNUSUAL and higher. We have to whitelist more cases in map_catchup so we don't spam the logs with perfectly-expected (but ignored) messages though. Signed-off-by: Rusty Russell <[email protected]>

While this shouldn't happen, it does (pending other fixes), and we stop reading the gossip store until next time. The result is partial gossip, demonstrated beautifully by NicolasDorier's report: ``` lightning_gossipd: gossmap: redundant channel_announce for 864063x1306x1, offsets 1272259 and 1784859!" ``` Gossipd stalld there and don't make more progress. So gossipd itself doesn't see the entire gossip_store. Then things get really batshit: ``` 2025-02-04T05:53:28.582Z DEBUG gossipd: Store compact time: 1429910 msec ``` This took 1429 seconds to process. Why? Because it hasn't been processing the gossip store fully, gossipd kept adding "new" records to the end: ``` 2025-02-04T05:53:28.583Z DEBUG gossipd: gossip_store: Read 62716143/1739952/5158256/0 cannounce/cupdate/nannounce/delete from store in 31634458462 bytes, now 31634458440 bytes (populated=true) ``` It has 31GB of gossip in there! No wonder it took so long... Signed-off-by: Rusty Russell <[email protected]> Fixes: ElementsProject#8035 Changelog-Fixed: gossipd: corruption in the gossip_store could cause ever-longer startup times and no gossip updates.

We're about to test them in gossmap. Signed-off-by: Rusty Russell <[email protected]>

We assume if it's incorrect, we simply need to wait. If this proves incorrect, we will see a stream of BROKEN log messages. To measure the performance impact, I timed tests/test_askrene.py::test_real_biases on my laptop. Before: 194.52s After: 202.81s So it's marginal. Signed-off-by: Rusty Russell <[email protected]>

Instead of making a copy. To measure the performance impact, I timed tests/test_askrene.py::test_real_biases on my laptop. No checksum check: 194.52s Copying for checksum check: 202.81s Zero-copy checksum check: 194.40s But these numbers proved noisy. Still, doesn't hurt. Signed-off-by: Rusty Russell <[email protected]>

If they go to stderr, you can't associate them with the record they're talking about. Signed-off-by: Rusty Russell <[email protected]>

It's actually the only one that uses it. We also tweak the way gossip_store handles failure: gossmap_manage now tells it when to reset the corrupted store. Signed-off-by: Rusty Russell <[email protected]>

This means we will correctly reset the store if it has redundant records, for example. Signed-off-by: Rusty Russell <[email protected]>

That's actually caught by the gossmap load now. Signed-off-by: Rusty Russell <[email protected]>

Found by inspection. Minor bug, since we'll catch it on the next block, but annoying. Signed-off-by: Rusty Russell <[email protected]>

We had at least one report of overwriting the gossip_store file at offset 1. Make sure this doesn't happen. Signed-off-by: Rusty Russell <[email protected]>

After analyzing various weird cases where we ended up with duplicate gossip_store entries, it could be explained by us not fully processing the gossip store. It's not clear that my assumptions that we would always see our own writes are true: technically this may require an fsync(). So we now add the check, and do an fsync and try again. Signed-off-by: Rusty Russell <[email protected]> Changelog-Fixed: gossipd: more sanity checks that we are correctly updating the gossip_store file.

``` 2025-01-23T12:31:52.528Z DEBUG gossipd: Pruning channel 839050x1246x0 from network view (ages 1736283379 and 1737600120) 2025-01-27T00:32:01.631Z DEBUG gossipd: Pruning channel 839050x1246x0 from network view (ages 0 and 1737686520) 2025-01-27T00:50:05.998Z **BROKEN** gossipd: Dying channel 839050x1246x0 already deleted? ``` Easiest not to prune in this case. Signed-off-by: Rusty Russell <[email protected]>

endothermicdev

This looks great. I'm looking forward to trying it out. I like how our test teardowns will do more validation via the new logcb and broken messages as well. Big debugging win if the included fixes don't already resolve our issues.

ACK 023ec62

endothermicdev · 2025-02-11T18:57:14Z

plugins/libplugin.h

@@ -522,6 +522,12 @@ struct plugin_timer *command_timer_(struct command *cmd,
 void plugin_log(struct plugin *p, enum log_level l, const char *fmt, ...) PRINTF_FMT(3, 4);
 void plugin_logv(struct plugin *p, enum log_level l, const char *fmt, va_list ap);

+/* plogin_logv wrapper for gossmap */


Suggested change

/* plogin_logv wrapper for gossmap */

/* plugin_logv wrapper for gossmap */

endothermicdev · 2025-02-11T20:17:29Z

gossipd/gossip_store.h

- * @gs: the gossip_store.
- *
- * Used by gossmap_manage to create a gossmap.
+ * Move the old gossip store out the way.  Log a broken message about it.


Suggested change

* Move the old gossip store out the way. Log a broken message about it.

* Move the old gossip store out of the way. Log a broken message about it.

endothermicdev · 2025-02-11T20:34:09Z

common/gossmap.h

+		      (cbarg))
+
+/* If we're the author of the gossmap, it should have no redundant records, corruption, etc.
+ * So this fails if that's not the case. */


Making Jon Postel proud! This is cool solution to what we were missing when gossmap took over the gossip_store.

rustyrussell added this to the v25.02 milestone Feb 5, 2025

rustyrussell force-pushed the guilt/gossmap-crash2 branch 3 times, most recently from ebf78bc to d32b9e5 Compare February 5, 2025 23:33

rustyrussell force-pushed the guilt/gossmap-crash2 branch from d32b9e5 to 15f5e4c Compare February 6, 2025 02:44

rustyrussell added 7 commits February 9, 2025 12:02

gossmap: add (and use) logging hook.

a6d66c8

Default goes to stderr for LOG_UNUSUAL and higher. We have to whitelist more cases in map_catchup so we don't spam the logs with perfectly-expected (but ignored) messages though. Signed-off-by: Rusty Russell <[email protected]>

tests: put proper checksums into test gossip_store files.

fd03690

We're about to test them in gossmap. Signed-off-by: Rusty Russell <[email protected]>

devtools: have dump-gossipstore print malformed warnings to stdout.

3b5b75f

If they go to stderr, you can't associate them with the record they're talking about. Signed-off-by: Rusty Russell <[email protected]>

rustyrussell force-pushed the guilt/gossmap-crash2 branch from 15f5e4c to 7ad3192 Compare February 10, 2025 02:11

rustyrussell added 2 commits February 10, 2025 14:55

gossipd: put gossip_store pointer inside gossmap_manage.

dd608fd

It's actually the only one that uses it. We also tweak the way gossip_store handles failure: gossmap_manage now tells it when to reset the corrupted store. Signed-off-by: Rusty Russell <[email protected]>

gossmap: stricter checks when gossipd itself loads the gossip_store.

8e12f79

This means we will correctly reset the store if it has redundant records, for example. Signed-off-by: Rusty Russell <[email protected]>

rustyrussell force-pushed the guilt/gossmap-crash2 branch from 7ad3192 to cec4174 Compare February 10, 2025 04:41

rustyrussell added 5 commits February 10, 2025 16:37

gossip_store: don't need to check for truncated amounts.

b1ab136

That's actually caught by the gossmap load now. Signed-off-by: Rusty Russell <[email protected]>

gossipd: correctly process dying channels.

fc0b33c

Found by inspection. Minor bug, since we'll catch it on the next block, but annoying. Signed-off-by: Rusty Russell <[email protected]>

gossipd: check that we are always appending.

9cc107a

We had at least one report of overwriting the gossip_store file at offset 1. Make sure this doesn't happen. Signed-off-by: Rusty Russell <[email protected]>

rustyrussell force-pushed the guilt/gossmap-crash2 branch from cec4174 to 023ec62 Compare February 10, 2025 06:08

endothermicdev approved these changes Feb 11, 2025

View reviewed changes

endothermicdev merged commit d554206 into ElementsProject:master Feb 11, 2025
40 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gossmap crash, issues #8053

Gossmap crash, issues #8053

Uh oh!

rustyrussell commented Feb 5, 2025

Uh oh!

endothermicdev left a comment

Uh oh!

endothermicdev Feb 11, 2025

Uh oh!

endothermicdev Feb 11, 2025

Uh oh!

endothermicdev Feb 11, 2025

Uh oh!

Uh oh!

Uh oh!

	/* plogin_logv wrapper for gossmap */
	/* plugin_logv wrapper for gossmap */

	* Move the old gossip store out the way. Log a broken message about it.
	* Move the old gossip store out of the way. Log a broken message about it.

Gossmap crash, issues #8053

Gossmap crash, issues #8053

Uh oh!

Conversation

rustyrussell commented Feb 5, 2025

Uh oh!

endothermicdev left a comment

Choose a reason for hiding this comment

Uh oh!

endothermicdev Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

endothermicdev Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

endothermicdev Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!