Fix Reaper not detecting ImagePullBackOff from reason field#2785
Fix Reaper not detecting ImagePullBackOff from reason field#2785Abhijeet212004 wants to merge 2 commits intojenkinsci:masterfrom
Conversation
The Reaper was only checking the message field for ImagePullBackOff errors, but Kubernetes actually sets the reason field. This caused pods to not get cleaned up when images failed to pull. Now checks the reason field first, then falls back to message field for backwards compatibility. Fixes jenkinsci#2772
| return waiting != null | ||
| && waiting.getMessage() != null | ||
| && waiting.getMessage().contains("Back-off pulling image"); |
There was a problem hiding this comment.
Thanks for the review @Vlatombe! Yes, I saw that #772 originally introduced this logic. The issue is that the code only checked the message field, but in some Kubernetes versions, ImagePullBackOff details appear in the reason field first. My fix checks both fields to ensure backward compatibility while catching all ImagePullBackOff scenarios.
|
Hi @jglick — could this PR get a review when you have a chance? We're hitting this bug regularly in production and it's causing significant impact. Most recently, a single deployment build created 1,285 pods in a 2-minute retry loop before a human noticed and manually aborted it. Each pod times out after 120s waiting for Ready, Jenkins deletes it and immediately creates a new one — indefinitely. The Reaper never fires because it's checking message instead of reason. We investigated this extensively 9 months ago and confirmed the Reaper's ImagePullBackOff detection is broken. There's no pipeline-level workaround — the pod recreation happens inside the Kubernetes plugin itself, below what Jenkinsfile retry() or timeout() can control. The fix in this PR is small and correct — check reason first, fall back to message. It already has an approval from @Vlatombe. Would be great to get this merged and into a release. |
I tested this PR on our Jenkins instance and found that the ImagePullBackOff fix alone did not stop the pod looping issue. The Reaper was not acting on the ImagePullBackOff events despite detecting them. Root cause: This PR branch was created before #2801 (revert of "Close Kubernetes client on cache eviction") was merged to master. Since the branch includes the original client.close() change from #2788, the Kubernetes client cache eviction closes the client after ~10 minutes, which kills the Reaper's pod watch. Without the watch, the Reaper never receives pod events and cannot terminate pods in ImagePullBackOff state. After rebasing this PR on latest master (which includes the #2801 revert), the fix works as expected: The pod is terminated after 3 backoff events, the queue item is cancelled, and the build aborts cleanly — no infinite pod loop. This PR needs a rebase on latest master before merging. |
The Reaper was only checking the message field for ImagePullBackOff errors, but Kubernetes actually sets the reason field. This caused pods to not get cleaned up when images failed to pull.
Now checks the reason field first, then falls back to message field for backwards compatibility.
Fixes #2772
Testing done
testTerminateAgentOnImagePullBackoffReasonFieldOnly()that reproduces the exact scenario from issue Reaper not terminating pods in ImagePullBackOff state #2772 where only thereasonfield is set to "ImagePullBackOff" (with null message)mvn clean verify -DskipTeststo ensure code compiles and passes all quality checks (Spotless, SpotBugs)reasonfield (primary Kubernetes indicator)messagefield if reason is not setSubmitter checklist