Skip to content

Fix Reaper not detecting ImagePullBackOff from reason field#2785

Open
Abhijeet212004 wants to merge 2 commits intojenkinsci:masterfrom
Abhijeet212004:fix/reaper-imagepullbackoff-2772
Open

Fix Reaper not detecting ImagePullBackOff from reason field#2785
Abhijeet212004 wants to merge 2 commits intojenkinsci:masterfrom
Abhijeet212004:fix/reaper-imagepullbackoff-2772

Conversation

@Abhijeet212004
Copy link
Copy Markdown

The Reaper was only checking the message field for ImagePullBackOff errors, but Kubernetes actually sets the reason field. This caused pods to not get cleaned up when images failed to pull.

Now checks the reason field first, then falls back to message field for backwards compatibility.

Fixes #2772

Testing done

  • Added a new test case testTerminateAgentOnImagePullBackoffReasonFieldOnly() that reproduces the exact scenario from issue Reaper not terminating pods in ImagePullBackOff state #2772 where only the reason field is set to "ImagePullBackOff" (with null message)
  • Verified the test passes with the fix
  • Ran mvn clean verify -DskipTests to ensure code compiles and passes all quality checks (Spotless, SpotBugs)
  • Manually verified the logic handles both scenarios:
    • New behavior: Detects ImagePullBackOff from reason field (primary Kubernetes indicator)
    • Backward compatibility: Still detects from message field if reason is not set

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Ensure you have provided tests that demonstrate the feature works or the issue is fixed

The Reaper was only checking the message field for ImagePullBackOff errors, but Kubernetes actually sets the reason field. This caused pods to not get cleaned up when images failed to pull.

Now checks the reason field first, then falls back to message field for backwards compatibility.

Fixes jenkinsci#2772
@Abhijeet212004 Abhijeet212004 requested a review from a team as a code owner January 4, 2026 18:27
@Abhijeet212004
Copy link
Copy Markdown
Author

Hi @Vlatombe @jglick - I'd appreciate if you could review this PR when you have time. This fixes the Reaper not detecting ImagePullBackOff from the reason field (issue #2772).

I've added test coverage and verified the fix works while maintaining backward compatibility. Thanks!

Comment on lines -603 to -605
return waiting != null
&& waiting.getMessage() != null
&& waiting.getMessage().contains("Back-off pulling image");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduced originally in #772

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @Vlatombe! Yes, I saw that #772 originally introduced this logic. The issue is that the code only checked the message field, but in some Kubernetes versions, ImagePullBackOff details appear in the reason field first. My fix checks both fields to ensure backward compatibility while catching all ImagePullBackOff scenarios.

@ns-gsa
Copy link
Copy Markdown

ns-gsa commented Mar 17, 2026

Hi @jglick — could this PR get a review when you have a chance? We're hitting this bug regularly in production and it's causing significant impact.

Most recently, a single deployment build created 1,285 pods in a 2-minute retry loop before a human noticed and manually aborted it. Each pod times out after 120s waiting for Ready, Jenkins deletes it and immediately creates a new one — indefinitely. The Reaper never fires because it's checking message instead of reason.

We investigated this extensively 9 months ago and confirmed the Reaper's ImagePullBackOff detection is broken. There's no pipeline-level workaround — the pod recreation happens inside the Kubernetes plugin itself, below what Jenkinsfile retry() or timeout() can control.

The fix in this PR is small and correct — check reason first, fall back to message. It already has an approval from @Vlatombe. Would be great to get this merged and into a release.

@ns-gsa
Copy link
Copy Markdown

ns-gsa commented Mar 18, 2026

Hi @jglick — could this PR get a review when you have a chance? We're hitting this bug regularly in production and it's causing significant impact.

Most recently, a single deployment build created 1,285 pods in a 2-minute retry loop before a human noticed and manually aborted it. Each pod times out after 120s waiting for Ready, Jenkins deletes it and immediately creates a new one — indefinitely. The Reaper never fires because it's checking message instead of reason.

We investigated this extensively 9 months ago and confirmed the Reaper's ImagePullBackOff detection is broken. There's no pipeline-level workaround — the pod recreation happens inside the Kubernetes plugin itself, below what Jenkinsfile retry() or timeout() can control.

The fix in this PR is small and correct — check reason first, fall back to message. It already has an approval from @Vlatombe. Would be great to get this merged and into a release.

I tested this PR on our Jenkins instance and found that the ImagePullBackOff fix alone did not stop the pod looping issue. The Reaper was not acting on the ImagePullBackOff events despite detecting them.

Root cause: This PR branch was created before #2801 (revert of "Close Kubernetes client on cache eviction") was merged to master. Since the branch includes the original client.close() change from #2788, the Kubernetes client cache eviction closes the client after ~10 minutes, which kills the Reaper's pod watch. Without the watch, the Reaper never receives pod events and cannot terminate pods in ImagePullBackOff state.

After rebasing this PR on latest master (which includes the #2801 revert), the fix works as expected:

Container [app-config] waiting [ImagePullBackOff] Back-off pulling image "registry.example.com/example-image:x.y.z"
ERROR: Image pull backoff detected, waiting for image to be available. Will wait for 2 more events before terminating the node.
...
ERROR: Image pull backoff detected, waiting for image to be available. Will wait for 1 more events before terminating the node.
...
ERROR: Unable to pull container image "registry.example.com/example-image:x.y.z". Check if image tag name is spelled correctly.
Queue task was cancelled
Finished: ABORTED

The pod is terminated after 3 backoff events, the queue item is cancelled, and the build aborts cleanly — no infinite pod loop.

This PR needs a rebase on latest master before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reaper not terminating pods in ImagePullBackOff state

3 participants