Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: stuck hook issue when a Job resource has a ttlSecondsAfterFinished field set #646

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dejanzele
Copy link

@dejanzele dejanzele commented Dec 4, 2024

This is a proof-of-concept PR which would fix argoproj/argo-cd#21055

More info on the issue can be found in the linked GitHub issue.

Idea is to add a finalizer to Job resources which have ttlSecondsAfterFinished set and remove it after ArgoCD detects the hook completed.

A simpler approach would be to unset the ttlSecondsAfterFinished but that would cause drift from defined vs actual state.

The proposed solution is to attach a finalizer on all hook tasks and remove it after the argocd acknowledges the hook task is completed in the sync phase.

The same scenario which is described in the linked GitHub issue passes in this PR.

I welcome any feedback, as I think a lot of people on the community would like this to be fixed, and I'd be more than happy to adopt based on best direction for this issue.

Comment on lines 886 to 888
if job.Spec.TTLSecondsAfterFinished == nil {
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the downside(s) to always using a finalizer instead of only using one when this field is set?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good question, the logic would be simpler and more generic if all hooks have a finalizer, cleanup would also be simpler.

Comment on lines 871 to 872
// processJobHookTask processes a hook task where the target object is a Job and has defined ttlSecondsAfterFinished.
// This addresses the issue where a Job with a ttlSecondsAfterFinished set to a low value gets deleted fast and the hook phase gets stuck.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the finalizer feature be limited to just Jobs? I understand that Argo Workflows can also exhibit the same behavior. I could imagine any resource having the same issue if some process deletes the resource before Argo CD has a chance to observe the final resource state.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, and yes, that is true.
I focused only on the immediate issue as a PoC, I am also in favour of a generic solution for the scenario where some external process deletes a resource during hook phase.

@dejanzele dejanzele force-pushed the fix/job-ttl-stuck-hook branch 4 times, most recently from 376d9d0 to 57d66fd Compare December 9, 2024 16:08
}
}
if mutated {
task.targetObj.SetFinalizers(finalizers)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I set it at all for the targetObj?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a glance, it's not clear to me.... would you mind investigating/documenting that liveObj and targetObj are on the syncTask struct?

@dejanzele
Copy link
Author

@crenshaw-dev I have updated the PR to make the solution more generic.

This use case should be also covered by an integration test, but I guess that test would live in the argocd repository?

@crenshaw-dev
Copy link
Member

@dejanzele yep! You can open a PR on the argo-cd repo temporarily replacing gitops-engine in go.mod with your fork and revision.

// In that case, we need to get the latest version of the object and retry the update.
return retry.RetryOnConflict(retry.DefaultRetry, func() error {
updateErr := sc.updateResource(task)
if apierr.IsConflict(updateErr) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conflicts happen quite often and without retries the E2E tests were running very flaky.

@dejanzele dejanzele force-pushed the fix/job-ttl-stuck-hook branch from f02987e to 4e93b3e Compare December 25, 2024 16:02
@dejanzele dejanzele requested a review from a team as a code owner December 25, 2024 16:02
Copy link

codecov bot commented Dec 25, 2024

Codecov Report

Attention: Patch coverage is 56.66667% with 26 lines in your changes missing coverage. Please review.

Project coverage is 54.45%. Comparing base (8849c3f) to head (4e93b3e).
Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
pkg/sync/sync_context.go 63.46% 17 Missing and 2 partials ⚠️
pkg/sync/hook/hook.go 0.00% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #646      +/-   ##
==========================================
+ Coverage   54.26%   54.45%   +0.18%     
==========================================
  Files          64       64              
  Lines        6164     6330     +166     
==========================================
+ Hits         3345     3447     +102     
- Misses       2549     2607      +58     
- Partials      270      276       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stuck hooks issue when a sync tasks contains a Job resource with a ttlSecondsAfterFinished field set
2 participants