OSAC-1383: force-cleanup orphaned CAPI Machine pre-terminate hooks during teardown#65
Conversation
…ring teardown The capi-provider-agent controller sets a pre-terminate hook annotation on CAPI Machines but gets killed during CP namespace teardown before removing it. This leaves the Machine stuck in Deleting with condition WaitingExternalHook, blocking the entire deletion cascade. Add _force_cleanup_machine_preterminate_hooks() to the existing wait_for_cluster_deletion polling loop alongside the AgentCluster finalizer and Agent label workarounds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@omer-vishlitzky: This pull request references OSAC-1383 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: omer-vishlitzky The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: osac-project/coderabbit/.coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (1)
WalkthroughThis PR extends the cluster deletion polling logic with an additional forced cleanup step targeting machine pre-terminate hooks. A new helper function discovers Machine resources in the CAPI namespace and annotates them to remove blocking hooks, complementing existing agent cluster and label cleanup operations. ChangesMachine Pre-terminate Hook Cleanup
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Security ConsiderationRisk Severity: Low | Impact: Cleanup/Remediation This change introduces privileged operations (
No new security surface is exposed by this cleanup step. 🚥 Pre-merge checks | ✅ 9 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (9 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| # 3. Machine pre-terminate hooks: the CAPI provider sets a pre-terminate hook | ||
| # annotation on Machines, but is killed before removing it. The CAPI Machine | ||
| # controller waits forever for the annotation to be removed, blocking the entire | ||
| # deletion cascade (Machine → MachineSet → CAPI Cluster → HostedCluster). |
There was a problem hiding this comment.
please add the hjypershift bug here
There was a problem hiding this comment.
https://redhat-external.slack.com/archives/C01C8502FMM/p1779892666472019
tl:dr
Details:
We hit an infinite deadlock deleting a HostedCluster on the Agent platform. The HostedCluster has been stuck in deletion for hours.
Root cause: The cluster-api-provider-agent controller runs as a deployment inside the control plane namespace. During HostedCluster deletion, the delete() function in
hostedcluster_controller.go:3334 (https://github.com/openshift/hypershift/blob/main/hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go#L3334) follows this sequence:
Delete the CAPI Cluster CR and wait for it to disappear
Once the CAPI Cluster is gone, remove finalizers from managed deployments including the capi-provider deployment — this allows the capi-provider-agent pod to be garbage collected
Delete the HCP and wait
Delete the control plane namespace and wait
The problem: the CAPI Cluster CR can disappear (step 1) while the AgentCluster CR still exists with its [agentclustercapi-provider.agent-install.openshift.io/deprovision](http://agentclustercapi-provider.agent-install.openshift.io/deprovision) finalizer. The CAPI
manager deletes the AgentCluster but doesn't wait for it to be fully gone before removing its own finalizer. Once step 2 removes the deployment finalizers, the capi-provider-agent pod gets
killed. Now nobody is alive to remove the AgentCluster's finalizer. The namespace can't terminate (it has a resource with a finalizer), and delete() loops at step 4 forever: "Waiting for
namespace deletion".
This is the exact same class of bug that was already fixed for Karpenter. karpenter.go:88-140
(https://github.com/openshift/hypershift/blob/main/hypershift-operator/controllers/hostedcluster/karpenter.go#L88-L140) has resolveKarpenterFinalizer() with the comment:
▎ "Without this fallback the HCP would be stuck in terminating with the karpenter finalizer blocking deletion indefinitely."
|
/lgtm |
|
/lgtm |
Summary
_force_cleanup_machine_preterminate_hooks()to thewait_for_cluster_deletionpolling looppre-terminate.delete.hook.machine.cluster.x-k8s.io/agentmachineannotations from CAPI Machines in the control plane namespaceProblem
The capi-provider-agent controller sets a pre-terminate hook annotation on CAPI Machines, but gets killed during CP namespace teardown before removing it. The CAPI Machine controller waits forever for the annotation to be removed, blocking the entire deletion cascade: Machine → MachineSet → CAPI Cluster → HostedCluster → ClusterOrder.
Evidence from failing CI run:
nodepool-order-jkf5q-ci-worker-h8lhhstuck withPreTerminateDeleteHookSucceeded: False, WaitingExternalHookitems: []— controller gone, no one left to remove the hookhelpers.py:221Test plan
Jira: https://redhat.atlassian.net/browse/OSAC-1383
🤖 Generated with Claude Code
Summary by CodeRabbit