Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Giant 11Gb gossip stores and node crashes in v24.08.1 #7763

Open
m-schmoock opened this issue Oct 24, 2024 · 4 comments
Open

Giant 11Gb gossip stores and node crashes in v24.08.1 #7763

m-schmoock opened this issue Oct 24, 2024 · 4 comments

Comments

@m-schmoock
Copy link
Collaborator

Issue and Steps to Reproduce

A v24.08.1 mainnet node I have access to was creating 11GB gossip files. I didn't notice that until the node crashed when closing a channel with:

2024-10-23T19:38:53.032Z **BROKEN** gossipd: gossip_store: get delete entry offset 1399/10934507092 (version v24.08.1-modded)
2024-10-23T19:38:53.033Z **BROKEN** gossipd: backtrace: common/daemon.c:38 (send_backtrace) 0x5575ea2847
2024-10-23T19:38:53.033Z **BROKEN** gossipd: backtrace: common/status.c:221 (status_failed) 0x5575ead743
2024-10-23T19:38:53.033Z **BROKEN** gossipd: backtrace: gossipd/gossip_store.c:466 (gossip_store_get_with_hdr) 0x5575e990fb
2024-10-23T19:38:53.033Z **BROKEN** gossipd: backtrace: gossipd/gossip_store.c:592 (gossip_store_set_timestamp) 0x5575e9975b
2024-10-23T19:38:53.033Z **BROKEN** gossipd: backtrace: gossipd/gossmap_manage.c:777 (process_channel_update) 0x5575e9aeeb
2024-10-23T19:38:53.033Z **BROKEN** gossipd: backtrace: gossipd/gossmap_manage.c:901 (gossmap_manage_channel_update) 0x5575e9b8ab
2024-10-23T19:38:53.033Z **BROKEN** gossipd: backtrace: gossipd/gossipd.c:215 (handle_recv_gossip) 0x5575e97f17
2024-10-23T19:38:53.033Z **BROKEN** gossipd: backtrace: gossipd/gossipd.c:307 (connectd_req) 0x5575e98017
2024-10-23T19:38:53.033Z **BROKEN** gossipd: backtrace: common/daemon_conn.c:35 (handle_read) 0x5575ea2b6b
2024-10-23T19:38:53.034Z **BROKEN** gossipd: backtrace: ccan/ccan/io/io.c:60 (next_plan) 0x5575f34397
2024-10-23T19:38:53.034Z **BROKEN** gossipd: backtrace: ccan/ccan/io/io.c:422 (do_plan) 0x5575f3496f
2024-10-23T19:38:53.034Z **BROKEN** gossipd: backtrace: ccan/ccan/io/io.c:439 (io_ready) 0x5575f34a4b
2024-10-23T19:38:53.034Z **BROKEN** gossipd: backtrace: ccan/ccan/io/poll.c:455 (io_loop) 0x5575f36a0b
2024-10-23T19:38:53.034Z **BROKEN** gossipd: backtrace: gossipd/gossipd.c:672 (main) 0x5575e9831f
2024-10-23T19:38:53.034Z **BROKEN** gossipd: backtrace: ../csu/libc-start.c:308 (__libc_start_main) 0x7f8ce2fdd7
2024-10-23T19:38:53.034Z **BROKEN** gossipd: backtrace: (null):0 ((null)) 0x5575e94167
2024-10-23T19:38:53.034Z **BROKEN** gossipd: STATUS_FAIL_INTERNAL_ERROR: gossip_store: get delete entry offset 1399/10934507092

After that I found out it was creating these big gossip files. This was not the first time the node was producing such large gossip stores, as a gossip_store.corrupt with the same size existed on the node.
The node had ridiculous long startup times, which I now believe was due to the fact it was processing these jumbo gossip stores.

If required, I can upload the 11GB store to my server so you can use it for debugging...

@m-schmoock m-schmoock changed the title Giant gossip stores and node crashes Giant 11Gb gossip stores and node crashes in v24.08.1 Oct 26, 2024
@gudnuf
Copy link
Contributor

gudnuf commented Nov 4, 2024

I also experienced a similar issue. My Gossip store was 14 GB and it crashed. Then when it would take 5 to 6 minutes to restart.
Solved by deleting my gossip_store

@erikarvstedt
Copy link

erikarvstedt commented Dec 17, 2024

Same issue here with our nixbitcoin.org node.
I haven't done any detailed debugging, but here are some observations:

  • In our case, file gossip_store has size 9.8 GiB.
  • The startup delay happens in gossipd, function gossip_store_compact.
    It consumes 100% CPU in a single thread while creating a new store file gossip_store.tmp.
    In our case, this takes 27 minutes.
  • Relevant debug log output
    12:25:26 lightningd[2225309]: DEBUG   gossipd: pid 2225533, msgfd 60
    12:53:08 lightningd[2225533]: lightning_gossipd: gossmap: redundant channel_announce for 835534x1336x3, 
    offsets 18187 and 150531!
    12:53:08 lightningd[2225309]: DEBUG   gossipd: Store compact time: 1660269 msec
    12:53:08 lightningd[2225309]: DEBUG   gossipd: gossip_store: Read 22206706/208217/307050/45 cannounce/cupdate/nannounce/delete from store in 10499177800 bytes, now 10499176198 bytes (populated=false)
    
  • The clightning data dir contains a file gossip_store.corrupt (size: 31.8 MiB) which was modified 4 days before this issue appeared.
  • The compaction step reduces the gossip_store file size by 1580 bytes.
  • When I first noticed the oversized gossip store file, the clightning version in use was 24.08.2.
    The file bloat probably happened with this version. If it had been created by an earlier version, we would have noticed startup failures after updating to 24.08.2 (due to systemd service timeouts).
  • This issue has been reported to our nix-bitcoin repo before: CLN crash loop after update-nix-bitcoin and deploy fort-nix/nix-bitcoin#747

I can share the affected gossip_store for further debugging.

@erikarvstedt
Copy link

Here are the historical sizes of gossip store files from our backups. These are all available data points.

Date File Size (MiB)
2024-10-31 gossip_store 180
2024-11-30 gossip_store 120
2024-12-01 gossip_store 163
2024-12-08 gossip_store 275
2024-12-13 gossip_store 519
2024-12-14 gossip_store.corrupt 32
gossip_store 790
2024-12-15 gossip_store.corrupt 32
gossip_store 5261
2024-12-16 gossip_store.corrupt 32
gossip_store 9550
2024-12-17 gossip_store.corrupt 32
gossip_store 10013

erikarvstedt added a commit to fort-nix/nixbitcoin.org that referenced this issue Dec 20, 2024
@matevz
Copy link

matevz commented Feb 9, 2025

So my gossip_store has grown 23GB in the last year. In the docs I can't seem to find what exactly is the file used for and whether is it safe to delete over time? So I went ahead and remove it and CLN seems to work fine. Any thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants