Fix Reaper not detecting ImagePullBackOff from reason field by Abhijeet212004 · Pull Request #2785 · jenkinsci/kubernetes-plugin

Abhijeet212004 · 2026-01-04T18:27:39Z

The Reaper was only checking the message field for ImagePullBackOff errors, but Kubernetes actually sets the reason field. This caused pods to not get cleaned up when images failed to pull.

Now checks the reason field first, then falls back to message field for backwards compatibility.

Fixes #2772

Testing done

Added a new test case testTerminateAgentOnImagePullBackoffReasonFieldOnly() that reproduces the exact scenario from issue Reaper not terminating pods in ImagePullBackOff state #2772 where only the reason field is set to "ImagePullBackOff" (with null message)
Verified the test passes with the fix
Ran mvn clean verify -DskipTests to ensure code compiles and passes all quality checks (Spotless, SpotBugs)
Manually verified the logic handles both scenarios:
- New behavior: Detects ImagePullBackOff from reason field (primary Kubernetes indicator)
- Backward compatibility: Still detects from message field if reason is not set

Submitter checklist

Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
Ensure that the pull request title represents the desired changelog entry
Please describe what you did
Link to relevant issues in GitHub or Jira
Ensure you have provided tests that demonstrate the feature works or the issue is fixed

The Reaper was only checking the message field for ImagePullBackOff errors, but Kubernetes actually sets the reason field. This caused pods to not get cleaned up when images failed to pull. Now checks the reason field first, then falls back to message field for backwards compatibility. Fixes jenkinsci#2772

Abhijeet212004 · 2026-01-05T07:03:44Z

Hi @Vlatombe @jglick - I'd appreciate if you could review this PR when you have time. This fixes the Reaper not detecting ImagePullBackOff from the reason field (issue #2772).

I've added test coverage and verified the fix works while maintaining backward compatibility. Thanks!

Vlatombe · 2026-01-05T09:05:33Z

-                return waiting != null
-                        && waiting.getMessage() != null
-                        && waiting.getMessage().contains("Back-off pulling image");


Introduced originally in #772

Thanks for the review @Vlatombe! Yes, I saw that #772 originally introduced this logic. The issue is that the code only checked the message field, but in some Kubernetes versions, ImagePullBackOff details appear in the reason field first. My fix checks both fields to ensure backward compatibility while catching all ImagePullBackOff scenarios.

ns-gsa · 2026-03-17T10:47:52Z

Hi @jglick — could this PR get a review when you have a chance? We're hitting this bug regularly in production and it's causing significant impact.

Most recently, a single deployment build created 1,285 pods in a 2-minute retry loop before a human noticed and manually aborted it. Each pod times out after 120s waiting for Ready, Jenkins deletes it and immediately creates a new one — indefinitely. The Reaper never fires because it's checking message instead of reason.

We investigated this extensively 9 months ago and confirmed the Reaper's ImagePullBackOff detection is broken. There's no pipeline-level workaround — the pod recreation happens inside the Kubernetes plugin itself, below what Jenkinsfile retry() or timeout() can control.

The fix in this PR is small and correct — check reason first, fall back to message. It already has an approval from @Vlatombe. Would be great to get this merged and into a release.

ns-gsa · 2026-03-18T04:00:49Z

Hi @jglick — could this PR get a review when you have a chance? We're hitting this bug regularly in production and it's causing significant impact.

Most recently, a single deployment build created 1,285 pods in a 2-minute retry loop before a human noticed and manually aborted it. Each pod times out after 120s waiting for Ready, Jenkins deletes it and immediately creates a new one — indefinitely. The Reaper never fires because it's checking message instead of reason.

We investigated this extensively 9 months ago and confirmed the Reaper's ImagePullBackOff detection is broken. There's no pipeline-level workaround — the pod recreation happens inside the Kubernetes plugin itself, below what Jenkinsfile retry() or timeout() can control.

The fix in this PR is small and correct — check reason first, fall back to message. It already has an approval from @Vlatombe. Would be great to get this merged and into a release.

I tested this PR on our Jenkins instance and found that the ImagePullBackOff fix alone did not stop the pod looping issue. The Reaper was not acting on the ImagePullBackOff events despite detecting them.

Root cause: This PR branch was created before #2801 (revert of "Close Kubernetes client on cache eviction") was merged to master. Since the branch includes the original client.close() change from #2788, the Kubernetes client cache eviction closes the client after ~10 minutes, which kills the Reaper's pod watch. Without the watch, the Reaper never receives pod events and cannot terminate pods in ImagePullBackOff state.

After rebasing this PR on latest master (which includes the #2801 revert), the fix works as expected:

Container [app-config] waiting [ImagePullBackOff] Back-off pulling image "registry.example.com/example-image:x.y.z"
ERROR: Image pull backoff detected, waiting for image to be available. Will wait for 2 more events before terminating the node.
...
ERROR: Image pull backoff detected, waiting for image to be available. Will wait for 1 more events before terminating the node.
...
ERROR: Unable to pull container image "registry.example.com/example-image:x.y.z". Check if image tag name is spelled correctly.
Queue task was cancelled
Finished: ABORTED

The pod is terminated after 3 backoff events, the queue item is cancelled, and the build aborts cleanly — no infinite pod loop.

This PR needs a rebase on latest master before merging.

Abhijeet212004 requested a review from a team as a code owner January 4, 2026 18:27

Abhijeet212004 mentioned this pull request Jan 5, 2026

idleMinutes config leads to limit counter leak with ephemeral templates (and restart before idle timeout) #2783

Open

Vlatombe added the bug Bug Fixes label Jan 5, 2026

Vlatombe reviewed Jan 5, 2026

View reviewed changes

Vlatombe approved these changes Jan 5, 2026

View reviewed changes

Vlatombe requested a review from jglick January 5, 2026 09:14

This was referenced Jan 6, 2026

Fix counter leak when Jenkins restarts with ephemeral templates #2787

Open

[JENKINS-75202] Orphaned agents not being cleaned up #2737

Open

Investigate memory limit 2766 #2790

Open

Merge branch 'master' into fix/reaper-imagepullbackoff-2772

7c2dd17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Reaper not detecting ImagePullBackOff from reason field#2785

Fix Reaper not detecting ImagePullBackOff from reason field#2785
Abhijeet212004 wants to merge 2 commits intojenkinsci:masterfrom
Abhijeet212004:fix/reaper-imagepullbackoff-2772

Abhijeet212004 commented Jan 4, 2026

Uh oh!

Abhijeet212004 commented Jan 5, 2026

Uh oh!

Vlatombe Jan 5, 2026

Uh oh!

Abhijeet212004 Jan 5, 2026

Uh oh!

ns-gsa commented Mar 17, 2026

Uh oh!

ns-gsa commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Abhijeet212004 commented Jan 4, 2026

Testing done

Submitter checklist

Uh oh!

Abhijeet212004 commented Jan 5, 2026

Uh oh!

Vlatombe Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Abhijeet212004 Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

ns-gsa commented Mar 17, 2026

Uh oh!

ns-gsa commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants