OSAC-1383: force-cleanup orphaned CAPI Machine pre-terminate hooks during teardown by omer-vishlitzky · Pull Request #65 · osac-project/osac-test-infra

omer-vishlitzky · 2026-06-10T09:14:08Z

Summary

Add _force_cleanup_machine_preterminate_hooks() to the wait_for_cluster_deletion polling loop
Removes orphaned pre-terminate.delete.hook.machine.cluster.x-k8s.io/agentmachine annotations from CAPI Machines in the control plane namespace
Third workaround alongside the existing AgentCluster finalizer and Agent label cleanups

Problem

The capi-provider-agent controller sets a pre-terminate hook annotation on CAPI Machines, but gets killed during CP namespace teardown before removing it. The CAPI Machine controller waits forever for the annotation to be removed, blocking the entire deletion cascade: Machine → MachineSet → CAPI Cluster → HostedCluster → ClusterOrder.

Evidence from failing CI run:

Machine nodepool-order-jkf5q-ci-worker-h8lhh stuck with PreTerminateDeleteHookSucceeded: False, WaitingExternalHook
AgentMachines: items: [] — controller gone, no one left to remove the hook
Same class of issue as the existing HACK at helpers.py:221

Test plan

CaaS e2e tests pass with this fix (deletion teardown no longer times out on orphaned hooks)
VMaaS e2e tests unaffected (no CAPI Machines in VMaaS path)

Jira: https://redhat.atlassian.net/browse/OSAC-1383

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Improved cluster deletion reliability by enhancing resource cleanup during the machine teardown phase to prevent deletion failures caused by unresolved system cleanup conditions.

…ring teardown The capi-provider-agent controller sets a pre-terminate hook annotation on CAPI Machines but gets killed during CP namespace teardown before removing it. This leaves the Machine stuck in Deleting with condition WaitingExternalHook, blocking the entire deletion cascade. Add _force_cleanup_machine_preterminate_hooks() to the existing wait_for_cluster_deletion polling loop alongside the AgentCluster finalizer and Agent label workarounds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

openshift-ci-robot · 2026-06-10T09:14:12Z

@omer-vishlitzky: This pull request references OSAC-1383 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Add _force_cleanup_machine_preterminate_hooks() to the wait_for_cluster_deletion polling loop

Removes orphaned pre-terminate.delete.hook.machine.cluster.x-k8s.io/agentmachine annotations from CAPI Machines in the control plane namespace

Third workaround alongside the existing AgentCluster finalizer and Agent label cleanups

Problem

The capi-provider-agent controller sets a pre-terminate hook annotation on CAPI Machines, but gets killed during CP namespace teardown before removing it. The CAPI Machine controller waits forever for the annotation to be removed, blocking the entire deletion cascade: Machine → MachineSet → CAPI Cluster → HostedCluster → ClusterOrder.

Evidence from failing CI run:

Machine nodepool-order-jkf5q-ci-worker-h8lhh stuck with PreTerminateDeleteHookSucceeded: False, WaitingExternalHook

AgentMachines: items: [] — controller gone, no one left to remove the hook

Same class of issue as the existing HACK at helpers.py:221

Test plan

CaaS e2e tests pass with this fix (deletion teardown no longer times out on orphaned hooks)

VMaaS e2e tests unaffected (no CAPI Machines in VMaaS path)

Jira: https://redhat.atlassian.net/browse/OSAC-1383

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-06-10T09:14:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: omer-vishlitzky

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [omer-vishlitzky]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-06-10T09:14:21Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository: osac-project/coderabbit/.coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: fbc724ef-9d20-4540-a638-bcc32e1a8730

📥 Commits

Reviewing files that changed from the base of the PR and between 68248de and e295907.

📒 Files selected for processing (1)

tests/core/helpers.py

Walkthrough

This PR extends the cluster deletion polling logic with an additional forced cleanup step targeting machine pre-terminate hooks. A new helper function discovers Machine resources in the CAPI namespace and annotates them to remove blocking hooks, complementing existing agent cluster and label cleanup operations.

Changes

Machine Pre-terminate Hook Cleanup

Layer / File(s)	Summary
Machine pre-terminate hook cleanup helper `tests/core/helpers.py`	New `_force_cleanup_machine_preterminate_hooks` function discovers Machine resources in the CAPI namespace derived from `k8s.namespace` and the hosted cluster name, then annotates each machine with the `pre-terminate.delete.hook.machine.cluster.x-k8s.io/agentmachine` hook key using `system:admin` impersonation via `run_unchecked` calls.
Integration into cluster deletion polling `tests/core/helpers.py`	The `_check_deleted` routine in `wait_for_cluster_deletion` now invokes the machine pre-terminate hook cleanup on each poll iteration, extending the prior forced cleanup sequence that covered agent cluster finalizers and agent labels.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

osac-project/osac-test-infra#56: Both PRs extend wait_for_cluster_deletion with additional forced cleanup helpers for teardown-blocking Kubernetes resources on agent-related objects.
osac-project/osac-test-infra#50: Both PRs add forced cleanup steps to wait_for_cluster_deletion polling to remove deletion-blocking finalizers and hooks.

Suggested labels

lgtm

Suggested reviewers

jhernand
trewest
akshaynadkarni

Poem

A hook once blocked the way,
Machines couldn't say their last goodbye,
Now cleanup clears the path each day,
With admin powers passing by—
Let deletion dance! ✨

Security Consideration

Risk Severity: Low | Impact: Cleanup/Remediation

This change introduces privileged operations (system:admin impersonation) for test infrastructure cleanup. The security posture is low-risk because:

The impersonation is confined to test-only helper code in tests/core/helpers.py
The operation is remedial (removing hooks to enable deletion) rather than creating privilege escalation
The impersonation pattern mirrors existing cleanup patterns already present in the codebase (agent cluster finalizer and label cleanup)
No new credentials or secrets are introduced; existing k8s._base() patterns are reused

No new security surface is exposed by this cleanup step.

🚥 Pre-merge checks | ✅ 9 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Ai-Attribution	⚠️ Warning	Commit uses Co-Authored-By for Claude Opus 4.6 AI tool without required Red Hat Assisted-by/Generated-by attribution trailers.	Replace Co-Authored-By trailer with proper Red Hat attribution (Assisted-by or Generated-by) per Red Hat policy for AI-assisted code contributions.

✅ Passed checks (9 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding a force-cleanup mechanism for orphaned CAPI Machine pre-terminate hooks during cluster teardown, which directly matches the new helper function and its integration into the deletion polling loop.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
No-Hardcoded-Secrets	✅ Passed	No hardcoded secrets detected. Code contains only infrastructure identifiers, kubectl flags, and Kubernetes annotation keys; "system:admin" is legitimate impersonation syntax.
No-Weak-Crypto	✅ Passed	PR adds infrastructure cleanup function with zero cryptographic operations; no weak crypto patterns (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB, custom crypto) or insecure comparisons detected.
No-Injection-Vectors	✅ Passed	No injection vectors detected: safe subprocess.run() with argument lists, machine names from structured API output, no shell=True or dangerous patterns (SQL/eval/pickle/yaml.load/os.system).
Container-Privileges	✅ Passed	PR modifies only Python test helpers, not container manifests. No privileged settings, hostPID/Network/IPC, SYS_ADMIN capabilities, or allowPrivilegeEscalation found.
No-Sensitive-Data-In-Logs	✅ Passed	New function _force_cleanup_machine_preterminate_hooks contains no logging/print statements that could expose sensitive data; only retrieves Kubernetes resource names and namespace metadata.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

rgolangh · 2026-06-10T10:17:12Z

+    # 3. Machine pre-terminate hooks: the CAPI provider sets a pre-terminate hook
+    #    annotation on Machines, but is killed before removing it. The CAPI Machine
+    #    controller waits forever for the annotation to be removed, blocking the entire
+    #    deletion cascade (Machine → MachineSet → CAPI Cluster → HostedCluster).


please add the hjypershift bug here

https://redhat-external.slack.com/archives/C01C8502FMM/p1779892666472019

tl:dr

Details: We hit an infinite deadlock deleting a HostedCluster on the Agent platform. The HostedCluster has been stuck in deletion for hours. Root cause: The cluster-api-provider-agent controller runs as a deployment inside the control plane namespace. During HostedCluster deletion, the delete() function in hostedcluster_controller.go:3334 (https://github.com/openshift/hypershift/blob/main/hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go#L3334) follows this sequence: Delete the CAPI Cluster CR and wait for it to disappear Once the CAPI Cluster is gone, remove finalizers from managed deployments including the capi-provider deployment — this allows the capi-provider-agent pod to be garbage collected Delete the HCP and wait Delete the control plane namespace and wait The problem: the CAPI Cluster CR can disappear (step 1) while the AgentCluster CR still exists with its [agentclustercapi-provider.agent-install.openshift.io/deprovision](http://agentclustercapi-provider.agent-install.openshift.io/deprovision) finalizer. The CAPI manager deletes the AgentCluster but doesn't wait for it to be fully gone before removing its own finalizer. Once step 2 removes the deployment finalizers, the capi-provider-agent pod gets killed. Now nobody is alive to remove the AgentCluster's finalizer. The namespace can't terminate (it has a resource with a finalizer), and delete() loops at step 4 forever: "Waiting for namespace deletion". This is the exact same class of bug that was already fixed for Karpenter. karpenter.go:88-140 (https://github.com/openshift/hypershift/blob/main/hypershift-operator/controllers/hostedcluster/karpenter.go#L88-L140) has resolveKarpenterFinalizer() with the comment: ▎ "Without this fallback the HCP would be stuck in terminating with the karpenter finalizer blocking deletion indefinitely."

rgolangh · 2026-06-10T10:18:05Z

/lgtm

rgolangh · 2026-06-10T10:18:45Z

/lgtm

openshift-ci-robot added the jira/valid-reference label Jun 10, 2026

openshift-ci Bot requested review from adriengentil and danmanor June 10, 2026 09:14

openshift-ci Bot added the approved label Jun 10, 2026

coderabbitai Bot approved these changes Jun 10, 2026

View reviewed changes

rgolangh reviewed Jun 10, 2026

View reviewed changes

openshift-ci Bot assigned rgolangh Jun 10, 2026

openshift-ci Bot added the lgtm label Jun 10, 2026

openshift-merge-bot Bot merged commit 2084a58 into osac-project:main Jun 10, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OSAC-1383: force-cleanup orphaned CAPI Machine pre-terminate hooks during teardown#65

OSAC-1383: force-cleanup orphaned CAPI Machine pre-terminate hooks during teardown#65
openshift-merge-bot[bot] merged 1 commit into
osac-project:mainfrom
omer-vishlitzky:fix-preterminate-hook-cleanup

omer-vishlitzky commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

openshift-ci-robot commented Jun 10, 2026 •

edited by openshift-ci Bot

Loading

Summary

Problem

Test plan

Uh oh!

openshift-ci Bot commented Jun 10, 2026

Uh oh!

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

❌ Failed checks (2 warnings)

Uh oh!

rgolangh Jun 10, 2026

Uh oh!

omer-vishlitzky Jun 10, 2026

Uh oh!

rgolangh commented Jun 10, 2026

Uh oh!

rgolangh commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

omer-vishlitzky commented Jun 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Test plan

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Jun 10, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Test plan

Uh oh!

openshift-ci Bot commented Jun 10, 2026

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Security Consideration

❌ Failed checks (2 warnings)

Uh oh!

rgolangh Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

omer-vishlitzky Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

rgolangh commented Jun 10, 2026

Uh oh!

rgolangh commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

omer-vishlitzky commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented Jun 10, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading