Skip to content

OSAC-1383: force-cleanup orphaned CAPI Machine pre-terminate hooks during teardown#65

Merged
openshift-merge-bot[bot] merged 1 commit into
osac-project:mainfrom
omer-vishlitzky:fix-preterminate-hook-cleanup
Jun 10, 2026
Merged

OSAC-1383: force-cleanup orphaned CAPI Machine pre-terminate hooks during teardown#65
openshift-merge-bot[bot] merged 1 commit into
osac-project:mainfrom
omer-vishlitzky:fix-preterminate-hook-cleanup

Conversation

@omer-vishlitzky

@omer-vishlitzky omer-vishlitzky commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add _force_cleanup_machine_preterminate_hooks() to the wait_for_cluster_deletion polling loop
  • Removes orphaned pre-terminate.delete.hook.machine.cluster.x-k8s.io/agentmachine annotations from CAPI Machines in the control plane namespace
  • Third workaround alongside the existing AgentCluster finalizer and Agent label cleanups

Problem

The capi-provider-agent controller sets a pre-terminate hook annotation on CAPI Machines, but gets killed during CP namespace teardown before removing it. The CAPI Machine controller waits forever for the annotation to be removed, blocking the entire deletion cascade: Machine → MachineSet → CAPI Cluster → HostedCluster → ClusterOrder.

Evidence from failing CI run:

  • Machine nodepool-order-jkf5q-ci-worker-h8lhh stuck with PreTerminateDeleteHookSucceeded: False, WaitingExternalHook
  • AgentMachines: items: [] — controller gone, no one left to remove the hook
  • Same class of issue as the existing HACK at helpers.py:221

Test plan

  • CaaS e2e tests pass with this fix (deletion teardown no longer times out on orphaned hooks)
  • VMaaS e2e tests unaffected (no CAPI Machines in VMaaS path)

Jira: https://redhat.atlassian.net/browse/OSAC-1383

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Improved cluster deletion reliability by enhancing resource cleanup during the machine teardown phase to prevent deletion failures caused by unresolved system cleanup conditions.

…ring teardown

The capi-provider-agent controller sets a pre-terminate hook annotation
on CAPI Machines but gets killed during CP namespace teardown before
removing it. This leaves the Machine stuck in Deleting with condition
WaitingExternalHook, blocking the entire deletion cascade.

Add _force_cleanup_machine_preterminate_hooks() to the existing
wait_for_cluster_deletion polling loop alongside the AgentCluster
finalizer and Agent label workarounds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci-robot

openshift-ci-robot commented Jun 10, 2026

Copy link
Copy Markdown

@omer-vishlitzky: This pull request references OSAC-1383 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Add _force_cleanup_machine_preterminate_hooks() to the wait_for_cluster_deletion polling loop
  • Removes orphaned pre-terminate.delete.hook.machine.cluster.x-k8s.io/agentmachine annotations from CAPI Machines in the control plane namespace
  • Third workaround alongside the existing AgentCluster finalizer and Agent label cleanups

Problem

The capi-provider-agent controller sets a pre-terminate hook annotation on CAPI Machines, but gets killed during CP namespace teardown before removing it. The CAPI Machine controller waits forever for the annotation to be removed, blocking the entire deletion cascade: Machine → MachineSet → CAPI Cluster → HostedCluster → ClusterOrder.

Evidence from failing CI run:

  • Machine nodepool-order-jkf5q-ci-worker-h8lhh stuck with PreTerminateDeleteHookSucceeded: False, WaitingExternalHook
  • AgentMachines: items: [] — controller gone, no one left to remove the hook
  • Same class of issue as the existing HACK at helpers.py:221

Test plan

  • CaaS e2e tests pass with this fix (deletion teardown no longer times out on orphaned hooks)
  • VMaaS e2e tests unaffected (no CAPI Machines in VMaaS path)

Jira: https://redhat.atlassian.net/browse/OSAC-1383

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from adriengentil and danmanor June 10, 2026 09:14
@openshift-ci

openshift-ci Bot commented Jun 10, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: omer-vishlitzky

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: osac-project/coderabbit/.coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: fbc724ef-9d20-4540-a638-bcc32e1a8730

📥 Commits

Reviewing files that changed from the base of the PR and between 68248de and e295907.

📒 Files selected for processing (1)
  • tests/core/helpers.py

Walkthrough

This PR extends the cluster deletion polling logic with an additional forced cleanup step targeting machine pre-terminate hooks. A new helper function discovers Machine resources in the CAPI namespace and annotates them to remove blocking hooks, complementing existing agent cluster and label cleanup operations.

Changes

Machine Pre-terminate Hook Cleanup

Layer / File(s) Summary
Machine pre-terminate hook cleanup helper
tests/core/helpers.py
New _force_cleanup_machine_preterminate_hooks function discovers Machine resources in the CAPI namespace derived from k8s.namespace and the hosted cluster name, then annotates each machine with the pre-terminate.delete.hook.machine.cluster.x-k8s.io/agentmachine hook key using system:admin impersonation via run_unchecked calls.
Integration into cluster deletion polling
tests/core/helpers.py
The _check_deleted routine in wait_for_cluster_deletion now invokes the machine pre-terminate hook cleanup on each poll iteration, extending the prior forced cleanup sequence that covered agent cluster finalizers and agent labels.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

  • osac-project/osac-test-infra#56: Both PRs extend wait_for_cluster_deletion with additional forced cleanup helpers for teardown-blocking Kubernetes resources on agent-related objects.
  • osac-project/osac-test-infra#50: Both PRs add forced cleanup steps to wait_for_cluster_deletion polling to remove deletion-blocking finalizers and hooks.

Suggested labels

lgtm

Suggested reviewers

  • jhernand
  • trewest
  • akshaynadkarni

Poem

A hook once blocked the way,
Machines couldn't say their last goodbye,
Now cleanup clears the path each day,
With admin powers passing by—
Let deletion dance! ✨


Security Consideration

Risk Severity: Low | Impact: Cleanup/Remediation

This change introduces privileged operations (system:admin impersonation) for test infrastructure cleanup. The security posture is low-risk because:

  • The impersonation is confined to test-only helper code in tests/core/helpers.py
  • The operation is remedial (removing hooks to enable deletion) rather than creating privilege escalation
  • The impersonation pattern mirrors existing cleanup patterns already present in the codebase (agent cluster finalizer and label cleanup)
  • No new credentials or secrets are introduced; existing k8s._base() patterns are reused

No new security surface is exposed by this cleanup step.

🚥 Pre-merge checks | ✅ 9 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Ai-Attribution ⚠️ Warning Commit uses Co-Authored-By for Claude Opus 4.6 AI tool without required Red Hat Assisted-by/Generated-by attribution trailers. Replace Co-Authored-By trailer with proper Red Hat attribution (Assisted-by or Generated-by) per Red Hat policy for AI-assisted code contributions.
✅ Passed checks (9 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a force-cleanup mechanism for orphaned CAPI Machine pre-terminate hooks during cluster teardown, which directly matches the new helper function and its integration into the deletion polling loop.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
No-Hardcoded-Secrets ✅ Passed No hardcoded secrets detected. Code contains only infrastructure identifiers, kubectl flags, and Kubernetes annotation keys; "system:admin" is legitimate impersonation syntax.
No-Weak-Crypto ✅ Passed PR adds infrastructure cleanup function with zero cryptographic operations; no weak crypto patterns (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB, custom crypto) or insecure comparisons detected.
No-Injection-Vectors ✅ Passed No injection vectors detected: safe subprocess.run() with argument lists, machine names from structured API output, no shell=True or dangerous patterns (SQL/eval/pickle/yaml.load/os.system).
Container-Privileges ✅ Passed PR modifies only Python test helpers, not container manifests. No privileged settings, hostPID/Network/IPC, SYS_ADMIN capabilities, or allowPrivilegeEscalation found.
No-Sensitive-Data-In-Logs ✅ Passed New function _force_cleanup_machine_preterminate_hooks contains no logging/print statements that could expose sensitive data; only retrieves Kubernetes resource names and namespace metadata.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread tests/core/helpers.py
# 3. Machine pre-terminate hooks: the CAPI provider sets a pre-terminate hook
# annotation on Machines, but is killed before removing it. The CAPI Machine
# controller waits forever for the annotation to be removed, blocking the entire
# deletion cascade (Machine → MachineSet → CAPI Cluster → HostedCluster).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add the hjypershift bug here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://redhat-external.slack.com/archives/C01C8502FMM/p1779892666472019

tl:dr

Details:
We hit an infinite deadlock deleting a HostedCluster on the Agent platform. The HostedCluster has been stuck in deletion for hours.

Root cause: The cluster-api-provider-agent controller runs as a deployment inside the control plane namespace. During HostedCluster deletion, the delete() function in
hostedcluster_controller.go:3334 (https://github.com/openshift/hypershift/blob/main/hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go#L3334) follows this sequence:


Delete the CAPI Cluster CR and wait for it to disappear
Once the CAPI Cluster is gone, remove finalizers from managed deployments including the capi-provider deployment — this allows the capi-provider-agent pod to be garbage collected
Delete the HCP and wait
Delete the control plane namespace and wait


The problem: the CAPI Cluster CR can disappear (step 1) while the AgentCluster CR still exists with its [agentclustercapi-provider.agent-install.openshift.io/deprovision](http://agentclustercapi-provider.agent-install.openshift.io/deprovision) finalizer. The CAPI
manager deletes the AgentCluster but doesn't wait for it to be fully gone before removing its own finalizer. Once step 2 removes the deployment finalizers, the capi-provider-agent pod gets
killed. Now nobody is alive to remove the AgentCluster's finalizer. The namespace can't terminate (it has a resource with a finalizer), and delete() loops at step 4 forever: "Waiting for
namespace deletion".

This is the exact same class of bug that was already fixed for Karpenter. karpenter.go:88-140
(https://github.com/openshift/hypershift/blob/main/hypershift-operator/controllers/hostedcluster/karpenter.go#L88-L140) has resolveKarpenterFinalizer() with the comment:

▎ "Without this fallback the HCP would be stuck in terminating with the karpenter finalizer blocking deletion indefinitely."

@rgolangh

Copy link
Copy Markdown
Contributor

/lgtm

@rgolangh

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-merge-bot openshift-merge-bot Bot merged commit 2084a58 into osac-project:main Jun 10, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants