`ContinuedTask` cannot rely on queue scheduling order #472

jglick · 2025-08-18T23:24:40Z

There seems to be a race condition with ContinuedTask: its effectiveness relies on the PlaceholderTask that was continued having actually been scheduled (and not merely in pendings but in buildables) by the time other tasks are added to the queue. That is not a safe assumption at least since jenkinsci/workflow-api-plugin#221 (and possibly even before that), and certainly as of jenkinsci/workflow-api-plugin#368 this is not a good basis for blocking queue items. I think it is necessary to somehow mark the Slave prior to shutdown as having been used for a given build, and wait for it to be ready before considering scheduling anything else onto the agent. ~~(But not using anything in config.xml, since that would be clobbered by JCasC in the case of permanent agents.) Perhaps there is some simpler criterion that can be used.~~

CloudBees-internal reference

jglick · 2025-08-18T23:26:20Z

src/test/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepDynamicContextTest.java

+            var tardyB = j.jenkins.getItemByFullName("tardy", WorkflowJob.class).getBuildByNumber(1);
+            j.waitForMessage("Ready to run", tardyB);
+            SemaphoreStep.success("tardy/1", null);
+            j.assertBuildStatusSuccess(j.waitForCompletion(tardyB));


java.lang.AssertionError: unexpected build status; build log was: ------ Started [Pipeline] Start of Pipeline [Pipeline] slowToResume [Pipeline] { [Pipeline] node Running on remote in …/agent-work-dirs/remote/workspace/tardy [Pipeline] { [Pipeline] semaphore Resuming build at Mon Aug 18 19:22:29 EDT 2025 after Jenkins restart Will resume outer step… …resumed. Waiting for reconnection of remote before proceeding with build remote has been removed for 15 sec; assuming it is not coming back, and terminating node step Ready to run at Mon Aug 18 19:22:47 EDT 2025 [Pipeline] } [Pipeline] // node [Pipeline] } [Pipeline] // slowToResume [Pipeline] End of Pipeline Timeout waiting for agent to come back org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: 03aac892-2c78-4c34-b72f-f4fc12d3c117 Finished: ABORTED ------ Expected: is <SUCCESS> but: was <ABORTED>

jglick · 2025-08-18T23:26:32Z

src/test/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepDynamicContextTest.java

+            @Override public void onResume() {
+                try {
+                    getContext().get(TaskListener.class).getLogger().println("Will resume outer step…");
+                    Thread.sleep(3_000);


Passes without this delay.

dwnusbaum · 2025-08-19T16:58:11Z

I think it is necessary to somehow mark the Slave prior to shutdown as having been used for a given build, and wait for it to be ready before considering scheduling anything else onto the agent. (But not using anything in config.xml, since that would be clobbered by JCasC in the case of permanent agents.) Perhaps there is some simpler criterion that can be used.

Maybe a QueueTaskDispatcher that stores persistent state about agents being used by node steps could help, since that would be able to take effect prior to the step resuming?

jglick · 2025-08-19T19:38:56Z

a QueueTaskDispatcher that stores persistent state about agents

I do not want to introduce any global state. Only things that can be stored in build and/or agent root dirs.

jglick · 2025-08-19T22:44:46Z

Related: #382

jglick · 2025-08-20T11:23:54Z

This approach seems to work, though I do not love the idea of writing config.xml for a permanent agent every time a node block starts or stops. At this point the bug in OSS remains theoretical (only observed under heavy load in a CloudBees CI HA controller). I may park this.

Demonstrating race condition with ContinuedTask

22de9fc

jglick added the bug label Aug 18, 2025

jglick commented Aug 18, 2025

View reviewed changes

jglick requested a review from dwnusbaum August 18, 2025 23:26

Sketch of a working fix

9aa08bd

jglick changed the title ~~Demonstrating race condition with ContinuedTask~~ ContinuedTask cannot rely on queue scheduling order Aug 19, 2025

jglick closed this Aug 20, 2025

jglick deleted the ContinuedTask branch August 20, 2025 23:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`ContinuedTask` cannot rely on queue scheduling order #472

`ContinuedTask` cannot rely on queue scheduling order #472

Uh oh!

jglick commented Aug 18, 2025 •

edited

Loading

Uh oh!

jglick Aug 18, 2025

Uh oh!

jglick Aug 18, 2025

Uh oh!

dwnusbaum commented Aug 19, 2025 •

edited

Loading

Uh oh!

jglick commented Aug 19, 2025

Uh oh!

jglick commented Aug 19, 2025

Uh oh!

jglick commented Aug 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

ContinuedTask cannot rely on queue scheduling order #472

ContinuedTask cannot rely on queue scheduling order #472

Uh oh!

Conversation

jglick commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jglick Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

jglick Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

dwnusbaum commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jglick commented Aug 19, 2025

Uh oh!

jglick commented Aug 19, 2025

Uh oh!

jglick commented Aug 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`ContinuedTask` cannot rely on queue scheduling order #472

`ContinuedTask` cannot rely on queue scheduling order #472

jglick commented Aug 18, 2025 •

edited

Loading

dwnusbaum commented Aug 19, 2025 •

edited

Loading