Skip to content

Conversation

@jglick
Copy link
Member

@jglick jglick commented Aug 18, 2025

There seems to be a race condition with ContinuedTask: its effectiveness relies on the PlaceholderTask that was continued having actually been scheduled (and not merely in pendings but in buildables) by the time other tasks are added to the queue. That is not a safe assumption at least since jenkinsci/workflow-api-plugin#221 (and possibly even before that), and certainly as of jenkinsci/workflow-api-plugin#368 this is not a good basis for blocking queue items. I think it is necessary to somehow mark the Slave prior to shutdown as having been used for a given build, and wait for it to be ready before considering scheduling anything else onto the agent. (But not using anything in config.xml, since that would be clobbered by JCasC in the case of permanent agents.) Perhaps there is some simpler criterion that can be used.

CloudBees-internal reference

@jglick jglick added the bug label Aug 18, 2025
var tardyB = j.jenkins.getItemByFullName("tardy", WorkflowJob.class).getBuildByNumber(1);
j.waitForMessage("Ready to run", tardyB);
SemaphoreStep.success("tardy/1", null);
j.assertBuildStatusSuccess(j.waitForCompletion(tardyB));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

java.lang.AssertionError: 
unexpected build status; build log was:
------
Started
[Pipeline] Start of Pipeline
[Pipeline] slowToResume
[Pipeline] {
[Pipeline] node
Running on remote in …/agent-work-dirs/remote/workspace/tardy
[Pipeline] {
[Pipeline] semaphore
Resuming build at Mon Aug 18 19:22:29 EDT 2025 after Jenkins restart
Will resume outer step…
…resumed.
Waiting for reconnection of remote before proceeding with build
remote has been removed for 15 sec; assuming it is not coming back, and terminating node step
Ready to run at Mon Aug 18 19:22:47 EDT 2025
[Pipeline] }
[Pipeline] // node
[Pipeline] }
[Pipeline] // slowToResume
[Pipeline] End of Pipeline
Timeout waiting for agent to come back
org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: 03aac892-2c78-4c34-b72f-f4fc12d3c117
Finished: ABORTED

------

Expected: is <SUCCESS>
     but: was <ABORTED>

@Override public void onResume() {
try {
getContext().get(TaskListener.class).getLogger().println("Will resume outer step…");
Thread.sleep(3_000);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passes without this delay.

@jglick jglick requested a review from dwnusbaum August 18, 2025 23:26
@dwnusbaum
Copy link
Member

dwnusbaum commented Aug 19, 2025

I think it is necessary to somehow mark the Slave prior to shutdown as having been used for a given build, and wait for it to be ready before considering scheduling anything else onto the agent. (But not using anything in config.xml, since that would be clobbered by JCasC in the case of permanent agents.) Perhaps there is some simpler criterion that can be used.

Maybe a QueueTaskDispatcher that stores persistent state about agents being used by node steps could help, since that would be able to take effect prior to the step resuming?

@jglick
Copy link
Member Author

jglick commented Aug 19, 2025

a QueueTaskDispatcher that stores persistent state about agents

I do not want to introduce any global state. Only things that can be stored in build and/or agent root dirs.

@jglick jglick changed the title Demonstrating race condition with ContinuedTask ContinuedTask cannot rely on queue scheduling order Aug 19, 2025
@jglick
Copy link
Member Author

jglick commented Aug 19, 2025

Related: #382

@jglick
Copy link
Member Author

jglick commented Aug 20, 2025

This approach seems to work, though I do not love the idea of writing config.xml for a permanent agent every time a node block starts or stops. At this point the bug in OSS remains theoretical (only observed under heavy load in a CloudBees CI HA controller). I may park this.

@jglick jglick closed this Aug 20, 2025
@jglick jglick deleted the ContinuedTask branch August 20, 2025 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants