Skip to content

graph/db: add zombie channels cleanup routine #10015

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

GustavoStingelin
Copy link
Contributor

@GustavoStingelin GustavoStingelin commented Jul 1, 2025

This PR addresses issue #9524, which caused zombie channels to remain in the in-memory graph cache. This led to incorrect pathfinding behavior and unnecessary memory consumption.

Benchmark

To evaluate the performance impact of the cleanup logic, I added a benchmark simulating a node graph with 50,000 nodes and 500,000 channels. On my machine, the cleanup took approximately 120 ms, which I think is acceptable for a daily cleanup routine. Additionally, we could potentially improve this by using the channelCache struct, but it appears underutilized.

$ go test -bench=. ./graph/db
goos: linux
goarch: amd64
pkg: github.com/lightningnetwork/lnd/graph/db
cpu: AMD Ryzen 7 5700X 8-Core Processor             
BenchmarkGraphCacheCleanupZombies-16    	       9	 120417156 ns/op	       120.3 ms/op	16957585 B/op	  149297 allocs/op

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @GustavoStingelin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a potential memory leak in the GraphCache by implementing a "zombie channel" cleanup mechanism. It introduces a dedicated index to track channels that cannot be immediately removed due to incomplete node information and a background process to periodically prune these channels, ensuring the cache remains consistent and efficient.

Highlights

  • Zombie Channel Tracking: Introduced a zombieIndex within the GraphCache to temporarily store channel IDs that cannot be fully removed immediately due to missing node information (e.g., a zeroVertex).
  • Background Cleanup Process: Added a new background goroutine (zombieCleaner) that periodically (every 24 hours) attempts to clean up channels listed in the zombieIndex, ensuring they are eventually removed from the cache.
  • Robust Channel Removal: Modified the RemoveChannel method to identify channels where one or both associated nodes are unknown (represented by a zeroVertex) and adds them to the zombieIndex for deferred cleanup, preventing potential memory leaks.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a background process to clean up zombie channels from the graph cache, addressing a potential memory leak. The implementation includes a dedicated goroutine and a zombie index to track channels awaiting removal. The review suggests improvements for maintainability and performance, such as making the cleanup interval configurable and optimizing the cleanup logic.

@ellemouton
Copy link
Collaborator

thanks for the PR @GustavoStingelin!
Feel free to ping me once this is ready for review. Also remember to remove the [skip ci] from the commit message at that point so that the CI can run

@GustavoStingelin GustavoStingelin force-pushed the graph-cache/zombie-channels branch 2 times, most recently from 8acd2d2 to 0142868 Compare July 1, 2025 22:00
@GustavoStingelin GustavoStingelin changed the title DRAFT: graph/db: add zombie channel process - WIP [skip ci] graph/db: add zombie channels cleanup routine Jul 1, 2025
@GustavoStingelin GustavoStingelin marked this pull request as ready for review July 2, 2025 02:08
@GustavoStingelin
Copy link
Contributor Author

@ellemouton ready!

@ellemouton ellemouton self-requested a review July 2, 2025 07:08
Copy link
Contributor

@MPins MPins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done! 👏

I ran the tests and everything LGTM ✅

Here are the benchmark results on my machine:

goos: linux
goarch: amd64
pkg: github.com/lightningnetwork/lnd/graph/db
cpu: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
=== RUN BenchmarkGraphCacheCleanupZombies
BenchmarkGraphCacheCleanupZombies
BenchmarkGraphCacheCleanupZombies-8 5 207848522 ns/op 207.8 ms/op 31292878 B/op 245935 allocs/op
PASS
ok github.com/lightningnetwork/lnd/graph/db 9.526s

Copy link
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great so far! Thanks for this 🙏

@@ -83,6 +94,9 @@ func NewGraphCache(preAllocNumNodes int) *GraphCache {
map[route.Vertex]*lnwire.FeatureVector,
preAllocNumNodes,
),
zombieIndex: make(map[uint64]struct{}),
zombieCleanerInterval: 24 * time.Hour,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably worth making this configurable. But that can be done in a follow up 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I open the follow up now or after the merge?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say after the merge - also curious about the value chosen here? why 24 hours instead of, say 1 hour?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the time value suggested in the Elle issue, but I believe a shorter interval might help keep the data more up to date. I personally prefer 1 hour, though this is purely empirical.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to one hour

@@ -290,7 +385,13 @@ func (c *GraphCache) getChannels(node route.Vertex) []*DirectedChannel {

i := 0
channelsCopy := make([]*DirectedChannel, len(channels))
for _, channel := range channels {
for cid, channel := range channels {
if _, ok := c.zombieIndex[cid]; ok {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to delete the channel at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's bc at this point we’re holding the mutex in read mode, so multiple reads can happen at once. But if we want to delete the channel, we’ll need to switch to a write lock, which might slow things down a bit. Another idea is to send the deletion request to a channel and have a separate goroutine handle it when it can grab the write lock.

have you thought of any other ways we could handle this?

@GustavoStingelin GustavoStingelin force-pushed the graph-cache/zombie-channels branch 3 times, most recently from 6a1bd16 to 46d2623 Compare July 10, 2025 16:39
Copy link
Member

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! My main question is - does it cost more if we just remove it directly inside RemoveChannel? And if the zombies are cleaned per X hours, does it mean the pathfinding may fail due to the zombies?

@@ -83,6 +94,9 @@ func NewGraphCache(preAllocNumNodes int) *GraphCache {
map[route.Vertex]*lnwire.FeatureVector,
preAllocNumNodes,
),
zombieIndex: make(map[uint64]struct{}),
zombieCleanerInterval: 24 * time.Hour,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say after the merge - also curious about the value chosen here? why 24 hours instead of, say 1 hour?

ticker := time.NewTicker(c.zombieCleanerInterval)
defer func() {
ticker.Stop()
c.wg.Done()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be moved to the top

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like this?

func (c *GraphCache) zombieCleaner() {
	defer c.wg.Done()
	ticker := time.NewTicker(c.zombieCleanerInterval)
	defer ticker.Stop()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, and a new line after defer c.wg.Done().

// leak the channel in the memory cache, since we don't have the
// node ID to remove, so we add it to the zombie index to post
// removal.
c.zombieIndex[chanID] = struct{}{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we know it's a zombie why can't we just remove it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to

does it cost more if we just remove it directly inside RemoveChannel?

the reason is that we only have the channel ID, not the node ID. To find the node ID, we need to traverse the nodeChannels map, which is more expansive. If there is a way to access the node ID directly, we might be able to avoid this extra zombieCleaner.
I am not entirely sure though. @ellemouton, could you help clarify this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to traverse the nodeChannels map, which is more expansive.

How much more expensive? I think it's about tradeoffs here - like how often we call RemoveChannel, and how large the nodeChannels map can be? And if we don't remove it in real-time, what's the implication on pathfinding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cost is loop over all nodes and retrieve their channels. It's about 16k nodes and requires a read-write lock on the map for 50-150ms, depending on the node's infrastructure (guessing based on benchmark with 50k nodes).

Without this PR, zombie channels could be returned during pathfinding, leading to incorrect route attempts. This PR introduces a low cost validation during node-channel memory retrieval (used in graph traversal) that filters out zombie channels from the returned slice.

As a result, pathfinding is no longer affected. The zombieCleaner mainly serves to free memory, without needing to lock the map for every read.

Fix a bug that leaks zombie channels in the memory graph, resulting in
incorrect path finding and memory usage.
@GustavoStingelin GustavoStingelin force-pushed the graph-cache/zombie-channels branch from 46d2623 to 4f05f0d Compare July 17, 2025 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants