Skip to content

Gossmap crash, issues #8053

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

rustyrussell
Copy link
Contributor

Fixes #7971

We have reports of crashes on reading gossip_store, including from gossipd itself!

```
lightning_gossipd: common/gossmap.c:121: map_copy: Assertion `offset + len <= map->map_size' failed.
...
lightning_gossipd: FATAL SIGNAL (version v24.11)
0x6260c41d682a send_backtrace
  common/daemon.c:33
0x6260c41e098b status_failed
  common/status.c:221
0x6260c41e0b41 status_backtrace_exit
  common/subdaemon.c:18
0x6260c41d68b8 crashdump
  common/daemon.c:78
0x70508ea6913f ???
  ???:0
0x70508e8a0d51 ???
  ???:0
0x70508e88a536 ???
  ???:0
0x70508e88a40e ???
  ???:0
0x70508e8996d1 ???
  ???:0
0x6260c41d8b69 map_copy
  common/gossmap.c:121
0x6260c41d8bab map_be16
  common/gossmap.c:142
0x6260c41daa45 map_catchup
  common/gossmap.c:705
0x6260c41dab95 gossmap_refresh_mayfail
  common/gossmap.c:1192
0x6260c41daca6 gossmap_refresh
  common/gossmap.c:1213
0x6260c41cee32 gossmap_manage_get_gossmap
  gossipd/gossmap_manage.c:1314
0x6260c41d0686 gossmap_manage_new_block
  gossipd/gossmap_manage.c:1221
0x6260c41cbfdd new_blockheight
  gossipd/gossipd.c:473
0x6260c41cc363 recv_req
  gossipd/gossipd.c:584
0x6260c41d6b1d handle_read
  common/daemon_conn.c:35
0x6260c43175b5 next_plan
  ccan/ccan/io/io.c:60
0x6260c4317a40 do_plan
  ccan/ccan/io/io.c:422
0x6260c4317af9 io_ready
  ccan/ccan/io/io.c:439
0x6260c4319446 io_loop
  ccan/ccan/io/poll.c:455
0x6260c41cccf4 main
  gossipd/gossipd.c:665
```

This implies that we have a message shorter than 2 bytes, which should never happen.

An audit didn't shed any light, but let's make sure we don't ever write such a thing.

Signed-off-by: Rusty Russell <[email protected]>
@rustyrussell rustyrussell added this to the v25.02 milestone Feb 5, 2025
@rustyrussell rustyrussell force-pushed the guilt/gossmap-crash2 branch 3 times, most recently from ebf78bc to d32b9e5 Compare February 5, 2025 23:33
We have a report of this happening under ZFS.  We cannot do much if
this really is a problem where we can't read back what we write, but
this avoids the immediate crash.

Fixes: ElementsProject#7971
Signed-off-by: Rusty Russell <[email protected]>
Changelog-Fixed: gossmap: occasional crash (at least on ZFS) reading gossip_store.
We only use it in one place, and that was simply to share an fd between
gossipd writing and gossipd reading, which may be causing our zfs problem
anyway.

In fact, it fixes a race if we don't have HAVE_PWRITEV.

Signed-off-by: Rusty Russell <[email protected]>
Default goes to stderr for LOG_UNUSUAL and higher.

We have to whitelist more cases in map_catchup so we don't spam the logs
with perfectly-expected (but ignored) messages though.

Signed-off-by: Rusty Russell <[email protected]>
While this shouldn't happen, it does (pending other fixes), and we stop reading the
gossip store until next time.  The result is partial gossip, demonstrated beautifully
by NicolasDorier's report:

```
lightning_gossipd: gossmap: redundant channel_announce for 864063x1306x1, offsets 1272259 and 1784859!"
```

Gossipd stalld there and don't make more progress.  So gossipd itself
doesn't see the entire gossip_store.

Then things get really batshit:

```
2025-02-04T05:53:28.582Z DEBUG   gossipd: Store compact time: 1429910 msec
```

This took 1429 seconds to process.  Why?

Because it hasn't been processing the gossip store fully, gossipd kept adding "new" records to the end:

```
2025-02-04T05:53:28.583Z DEBUG   gossipd: gossip_store: Read 62716143/1739952/5158256/0 cannounce/cupdate/nannounce/delete from store in 31634458462 bytes, now 31634458440 bytes (populated=true)
```

It has 31GB of gossip in there!  No wonder it took so long...

Signed-off-by: Rusty Russell <[email protected]>
Fixes: ElementsProject#8035
Changelog-Fixed: gossipd: corruption in the gossip_store could cause ever-longer startup times and no gossip updates.
We're about to test them in gossmap.

Signed-off-by: Rusty Russell <[email protected]>
We assume if it's incorrect, we simply need to wait.  If this proves incorrect,
we will see a stream of BROKEN log messages.

To measure the performance impact, I timed
tests/test_askrene.py::test_real_biases on my laptop.

	Before: 194.52s
	After: 202.81s

So it's marginal.

Signed-off-by: Rusty Russell <[email protected]>
Instead of making a copy.

To measure the performance impact, I timed
tests/test_askrene.py::test_real_biases on my laptop.

	No checksum check: 194.52s
	Copying for checksum check: 202.81s
	Zero-copy checksum check: 194.40s

But these numbers proved noisy.  Still, doesn't hurt.

Signed-off-by: Rusty Russell <[email protected]>
If they go to stderr, you can't associate them with the record they're
talking about.

Signed-off-by: Rusty Russell <[email protected]>
It's actually the only one that uses it.  We also tweak the way
gossip_store handles failure: gossmap_manage now tells it when to
reset the corrupted store.

Signed-off-by: Rusty Russell <[email protected]>
This means we will correctly reset the store if it has redundant
records, for example.

Signed-off-by: Rusty Russell <[email protected]>
That's actually caught by the gossmap load now.

Signed-off-by: Rusty Russell <[email protected]>
Found by inspection.  Minor bug, since we'll catch it on the next block,
but annoying.

Signed-off-by: Rusty Russell <[email protected]>
We had at least one report of overwriting the gossip_store file at
offset 1.  Make sure this doesn't happen.

Signed-off-by: Rusty Russell <[email protected]>
After analyzing various weird cases where we ended up with duplicate
gossip_store entries, it could be explained by us not fully processing
the gossip store.

It's not clear that my assumptions that we would always see our own writes
are true: technically this may require an fsync().  So we now add the
check, and do an fsync and try again.

Signed-off-by: Rusty Russell <[email protected]>
Changelog-Fixed: gossipd: more sanity checks that we are correctly updating the gossip_store file.
```
2025-01-23T12:31:52.528Z DEBUG   gossipd: Pruning channel 839050x1246x0 from network view (ages 1736283379 and 1737600120)
2025-01-27T00:32:01.631Z DEBUG   gossipd: Pruning channel 839050x1246x0 from network view (ages 0 and 1737686520)
2025-01-27T00:50:05.998Z **BROKEN** gossipd: Dying channel 839050x1246x0 already deleted?
```

Easiest not to prune in this case.

Signed-off-by: Rusty Russell <[email protected]>
Copy link
Collaborator

@endothermicdev endothermicdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I'm looking forward to trying it out. I like how our test teardowns will do more validation via the new logcb and broken messages as well. Big debugging win if the included fixes don't already resolve our issues.

ACK 023ec62

@@ -522,6 +522,12 @@ struct plugin_timer *command_timer_(struct command *cmd,
void plugin_log(struct plugin *p, enum log_level l, const char *fmt, ...) PRINTF_FMT(3, 4);
void plugin_logv(struct plugin *p, enum log_level l, const char *fmt, va_list ap);

/* plogin_logv wrapper for gossmap */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* plogin_logv wrapper for gossmap */
/* plugin_logv wrapper for gossmap */

* @gs: the gossip_store.
*
* Used by gossmap_manage to create a gossmap.
* Move the old gossip store out the way. Log a broken message about it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Move the old gossip store out the way. Log a broken message about it.
* Move the old gossip store out of the way. Log a broken message about it.

(cbarg))

/* If we're the author of the gossmap, it should have no redundant records, corruption, etc.
* So this fails if that's not the case. */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making Jon Postel proud! This is cool solution to what we were missing when gossmap took over the gossip_store.

@endothermicdev endothermicdev merged commit d554206 into ElementsProject:master Feb 11, 2025
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CLN crash and gossip store corruption
2 participants