-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition around supervisor's Commander #291
Open
tpaschalis
wants to merge
5
commits into
open-telemetry:main
Choose a base branch
from
tpaschalis:fix-testnewcommander-race
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+70
−10
Open
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
d191a3b
Fix race condition around supervisor's Commander
tpaschalis b45a84f
Reuse new condition
tpaschalis 1337ec4
Synchronize stopping explicitly
tpaschalis f1840f7
Set IsStopping flag first things first
tpaschalis f135e3b
Add test for verifying proper shutdown
tpaschalis File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I've actually flipped the condition from what it was before, as per the method's docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this potentially leave the agent process running in some cases? The NewSupervisor method starts a goroutine
runAgentProcess
to launch the agent process. There's a possibility thatShutdown
might be called beforec.cmd.Start()
finishes. Ifc.cmd.Start()
completes successfully, the process would keep running, which isn't expected.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, let me think about this (and if the previous code prevented that from happening). We could potentially store the running value a little earlier to avoid that IIUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there, sorry for taking so long to get back to you. As far as I can tell from the code, this was already the case and it isn't a regression from this PR.
To avoid this, we'd make it the caller's responsibility to call only call Stop() after IsRunning() returns true (for example having a backoff mechanism for properly cleaning up resources), or add more explicit synchronization like we do in clientcommon
opamp-go/client/internal/clientcommon.go
Lines 53 to 55 in ed38d5f
IMO this is a separate issue than the race condition the one's fixed here, so tackling the cleanup of resoueces might be a new discussion. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think it was possible to leave the process running in the background because we are checking if
c.cmd
is non-nil (which is the source of the race condition) instead of running status. I think theStop
should be updated to handle this.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay.
Here is the case I think this still doesn't address
Start()
is calledc.running.Store(true)
is executedShutdown()
is called during this window!c.IsRunning()
evaluates to true at this point,Shutdown()
returns immediatelyDoes that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I've pushed another commit in the Stop method to set the isStoppingFlag immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, let me try to get this right.
The commander's Start is called only in two cases
a) In the Supervisor's startAgent.
b) During a restart (not pertinent to our discussion)
The startAgent method is only called via the runAgentProcess; in turn, this is only called as a new goroutine in the NewSupervisor function.
In the Shutdown method, the Commander's Stop sets the isStopping flag.
So let's say that NewSupervisor starts, and the new goroutine is launched.
Immediately, the commander's Shutdown method is called, to call the commander's Stop and set the isStoppingFlag. When the call stack gets to the commander's Start it will immediately exit with the IsStopping method without having a chance to launch any process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @srikanthccv I've put out a test in f135e3b to try and verify the behavior we want.
I couldn't find any more elegant way than the ugly DelayLogger, but I'd like to see what you think. (This test fails in main, without my fix here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take this case
NewSupervisor starts, and the goroutine
runAgentProcess
is launched, ->startAgent
->s.commander.Start
-> the execution reachesc.cmd.Start
opamp-go/internal/examples/supervisor/supervisor/commander/commander.go
Line 61 in 7cdd395
And say
cmd.Start
didn't return yetNow, assume the commander's Shutdown method is called (
c.cmd.Start
still didn't return) soc.running
is false. In this flowisStoppingFlag
is not helping because we would still return asIsRunning
evaluates to true.I think if we add a waitgroup before the
c.cmd.Start
and wait in shutdown should solve the issue.