Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rare deadlock in GraphNode.GetOrComputeValue #18115

Open
majocha opened this issue Dec 7, 2024 · 3 comments · May be fixed by #18178
Open

Rare deadlock in GraphNode.GetOrComputeValue #18115

majocha opened this issue Dec 7, 2024 · 3 comments · May be fixed by #18178
Labels
Area-Compiler-Service Various compiler service issues which do not belong to other labels/areas. Bug Impact-Low (Internal MS Team use only) Describes an issue with limited impact on existing code.
Milestone

Comments

@majocha
Copy link
Contributor

majocha commented Dec 7, 2024

GraphNode.GetOrComputeValue can deadlock when the threadpool is under some pressure.

See: https://dev.azure.com/dnceng-public/public/_build/results?buildId=886415&view=logs&j=966a6946-91f0-5397-81f4-d998467880f9&t=344839a9-08c6-5829-65fb-26f32c5a016f&l=17626

Repro steps

This is reproducible in #17872 locally. For example:
in VS select all BuildGraphTests in Test Explorer and "run until failure".
The tests run in parallel and after few dozens or some more iterations the test case Many requests to get a value asynchronously might evaluate the computation more than once even when some requests get canceled will get stuck here

@github-actions github-actions bot added this to the Backlog milestone Dec 7, 2024
@T-Gro T-Gro added Impact-Low (Internal MS Team use only) Describes an issue with limited impact on existing code. and removed Needs-Triage labels Dec 9, 2024
@T-Gro
Copy link
Member

T-Gro commented Dec 9, 2024

Thank you Jakub for finding this.
Is it right to assume you only spotted this with parallel tests?

@T-Gro T-Gro added the Area-Compiler-Service Various compiler service issues which do not belong to other labels/areas. label Dec 9, 2024
@majocha
Copy link
Contributor Author

majocha commented Dec 9, 2024

Thank you Jakub for finding this. Is it right to assume you only spotted this with parallel tests?

Yes, it showed up once in CI recently. I confirmed a repro locally but also only in a parallel test run.

@majocha
Copy link
Contributor Author

majocha commented Dec 28, 2024

The issue is subtle. We attach a continuation that sets the taken flag:

semaphore
.WaitAsync(ct)
.ContinueWith(
(fun _ -> taken <- true),
(TaskContinuationOptions.NotOnCanceled
||| TaskContinuationOptions.NotOnFaulted
||| TaskContinuationOptions.ExecuteSynchronously)
)
|> Async.AwaitTask

ExecuteSynchronously is there to ensure it is always executed before we get to this finally block:

finally
if taken then
semaphore.Release() |> ignore

But SemaphoreSlim enforces asynchronous continuations, overriding ExecuteSynchronously.

In effect, on rare occasions when the cancellation coincides with taking the samaphore, the finally block is executed before setting the flag. As a result, the semaphore is never released.

@majocha majocha linked a pull request Dec 28, 2024 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Compiler-Service Various compiler service issues which do not belong to other labels/areas. Bug Impact-Low (Internal MS Team use only) Describes an issue with limited impact on existing code.
Projects
Status: New
Development

Successfully merging a pull request may close this issue.

2 participants