Skip to content

Conversation

@ARR4N
Copy link
Contributor

@ARR4N ARR4N commented Nov 19, 2025

If snapshot.diskLayer.generate() completes early enough it set its genMarker field to nil and races with the diskLayer.stopGeneration() method, possibly leaking the generate() goroutine as it never receives on the genAbort channel. In practice this only means that Disable() would leak the goroutine, and that any test with goleak.Verify*() needs to ignore it. A cursory read suggests that Journal() will successfully release resources.

I've so far only demonstrated this in the test but not fixed it. Before doing so, I wanted to see if anyone thought it was worth it. For me it's only a minor annoyance in always having to ignore the method when checking for my own leaks.

@ARR4N ARR4N changed the title fix(snapshot): race resulting in generate() goroutine leak snapshot: race resulting in generate() goroutine leak Nov 19, 2025
@ARR4N ARR4N changed the title snapshot: race resulting in generate() goroutine leak core/state/snapshot: race resulting in generate() goroutine leak Nov 19, 2025
@ARR4N
Copy link
Contributor Author

ARR4N commented Nov 19, 2025

@rjl493456442 I'm gonna leave this in draft as it's only the test demonstration for now; please let me know if you think it's worth addressing.

@rjl493456442
Copy link
Member

rjl493456442 commented Nov 24, 2025

Hi @ARR4N

I still don't get the issue around.

If snapshot.diskLayer.generate() completes early enough it set its genMarker field to nil

Yes, it's true. We have a dedicated code waiting this event, making sure the test code will
be blocked before the generation is finished. Why do you think it's a data race?

	select {
	case <-snap.genPending:
		// Snapshot generation succeeded

	case <-time.After(3 * time.Second):
		t.Errorf("Snapshot generation failed")
	}
	checkSnapRoot(t, snap, root)

Only nitpick is probably we can change the t.Errorf("Snapshot generation failed") to t.Fatalf("Snapshot generation failed") to terminate the remaining execution.

@ARR4N
Copy link
Contributor Author

ARR4N commented Nov 24, 2025

Yes, it's true. We have a dedicated code waiting this event, making sure the test code will be blocked before the generation is finished. Why do you think it's a data race?

This only works within the snapshot package's tests, but doesn't account for external usage. I encountered it because of the Disable() method's comment:

Disable interrupts any pending snapshot generator, ...

In hindsight I see that it does say "pending", but the integration between Disable() and diskLayer.generate() does seem like Disable() was intended1 to clear that goroutine.

Footnotes

  1. This assumption being why I'm asking before addressing it.

@rjl493456442
Copy link
Member

Something to be aware of is that we plan to deprecate this snapshot soon. It's only needed for the legacy hash mode, which is required by the hash-mode archive node.

Once we roll out the new archive node based on path mode, and the migration is complete, both the snapshot and the hash-mode database will be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants