Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: when content still exist, prune again #3802

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

amurzeau
Copy link
Contributor

Hi,

When tests run on github actions, it happens that some of them fails with content still exists, as reported in #3401.

Sometimes, a saved trace is still in the content store and is not removed.
Adding a prune call again when the content store is not empty seems to fix the issue.

Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is fine as a hotfix but it can hide away actual error. The current flaky error in CI happens because HistoryAPI and test framework have a sync conflict that should be still fixed with something that only addresses that specific issue.

@amurzeau
Copy link
Contributor Author

amurzeau commented Apr 21, 2023

Indeed, it seems the sequence of what is happening is this (maybe my vocabulary isn't very good here :-) ):

  • a test is run and generate objects in the content store
  • recordBuildHistory() function is run at the end of the test on buildkitd side
  • recordBuildHistory will start an async goroutine without waiting its completion
  • this goroutine starts with a 3s sleep
  • on client_test side, checkAllReleasable is called
    • it will call ListenBuildHistory and wait for all events to be processed
    • it will check DiskUsage returns only non used item (InUse = false)
    • it will run Prune()
  • on buildkitd side, the prune is run
  • but later (~2.5s after the start of Prune()), the sleeping goroutine wakeup after its 3s sleep and save a application/vnd.buildkit.otlp.json.v0 history blob in the content store
  • on client_test side, the Prune ends (it took 3.3s)
    • but the content store check will find the new blob which won't be pruned as it was added after the Prune() call

See more logs details here: https://github.com/amurzeau/buildkit/actions/runs/4769578985/jobs/8480092417

I'm not sure how this remaining blob could be ignored by tests, maybe there is a way to get a metadata associated with it like application/vnd.buildkit.otlp.json.v0 ?
Maybe continuing to listen history events until the end of the checkAllReleasable function, and in the content store check loop, ignore everything matching a received event ?

@amurzeau
Copy link
Contributor Author

amurzeau commented May 1, 2023

After digging more, I found that:

Most of the time, application/vnd.buildkit.otlp.json.v0 is saved after the Prune operation is completed.

  • when the Prune operation is done, nothing is left and the checkAllReleasable function won't fail
  • then, after 3s timer, when saving trace, an error occurs because the ref is no longer available (I guess it was deleted by the Prune while waiting for the 3s)
  • when the lease is deleted, containerd will trigger a garbage collection
  • the garbage collection will remove the created content object as it is not referenced
  • so even if checkAllReleasable did not wait for it, it is still removed

But sometimes, the trace is saved during a garbage collection, this is what happen (see https://github.com/amurzeau/buildkit/actions/runs/4854419604/jobs/8651910873):

  • Prune is called and trigger a garbage collection on containerd side:
    2023-05-01T20:20:53.8416441Z sandbox.go:281: time="2023-05-01T20:13:20.68636493Z" level=debug msg="contentstore: prunning" spanID=7babd39e99bea6c9 traceID=f8bfbefbc8e49e6ccad0a6e8225c7603

  • while doing the garbage collection, the trace fail to be saved as the ref is already deleted (failed to save trace)
    2023-05-01T20:20:53.8460445Z sandbox.go:281: time="2023-05-01T20:13:20.71822571Z" level=debug msg="contentstore: saving trace for 0yk3s9htik3suzqii6vg69cbz: sha256:cfcf37dd0da8371f314a7fb42353417e55b9c0fa9d8ff6744abcfb256ccc0936"
    2023-05-01T20:20:53.8470152Z sandbox.go:281: time="2023-05-01T20:13:20.71943042Z" level=error msg="failed to save trace for 0yk3s9htik3suzqii6vg69cbz: file does not exist\ngithub.com/moby/buildkit/solver/llbsolver.(*Solver).recordBuildHistory.func1.8.1\n\t/src/solver/llbsolver/solver.go:396\ngithub.com/moby/buildkit/solver/llbsolver.(*Solver).recordBuildHistory.func1.8\n\t/src/solver/llbsolver/solver.go:399\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598"

  • this trigger a garbage collection
    2023-05-01T20:20:53.8461058Z sandbox.go:281: time="2023-05-01T20:13:20.717223801Z" level=debug msg="garbage collection trigger: {ts:{wall:13910534272206231409 ext:9606598204 loc:0x55cd3dc9c200} mutation:true dirty:false}"

  • the ongoing garbage collection finish and the timestamp of the last garbage collection is updated (https://github.com/amurzeau/containerd/blob/a7339a172d77afd2668070a91ddc7acd719e834d/gc/scheduler/scheduler.go#L314-L318)
    2023-05-01T20:20:53.8484050Z sandbox.go:281: time="2023-05-01T20:13:20.726406581Z" level=debug msg="contentstore: done garbage collection, last updated"

  • GC triggers are processed but ignored as their timestamp is before the last GC (https://github.com/amurzeau/containerd/blob/a7339a172d77afd2668070a91ddc7acd719e834d/gc/scheduler/scheduler.go#L284)
    2023-05-01T20:20:53.8486082Z sandbox.go:281: time="2023-05-01T20:13:20.726452282Z" level=debug msg="garbage collection trigger ignored: {ts:{wall:13910534272206231409 ext:9606598204 loc:0x55cd3dc9c200} mutation:true dirty:false}"

So the remaining trace in the content store is not removed as nothing more is done by the test, no more garbage collection will be triggered on containerd side.

@amurzeau
Copy link
Contributor Author

amurzeau commented May 1, 2023

Do you want me to keep this solution open ? Else I can close it.
Maybe just triggering a garbage collection directly on containerd side when content still exist in the content store would be better ? This should also fix the issue.

@amurzeau amurzeau force-pushed the contrib1 branch 2 times, most recently from 1857ff8 to 32a178f Compare May 2, 2023 12:48
Sometimes, a saved trace is still in the content store and is not
removed by garbage collection call due to timing issues (a history trace
is saved in containerd while it is running the garbage collection).

When this happens, trigger a garbage collection again so the remaining
content is removed.

Signed-off-by: Alexis Murzeau <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants