Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions tests/core/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -229,9 +229,15 @@ def wait_for_cluster_deletion(*, k8s: K8sClient, name: str) -> None:
# clusterdeployment-namespace from agents after HostedCluster deletion. The delete
# playbook's detach_and_unlabel skips agents that still have this label set, leaving
# the clusterorder label stuck and blocking agent reuse for subsequent tests.
#
# 3. Machine pre-terminate hooks: the CAPI provider sets a pre-terminate hook
# annotation on Machines, but is killed before removing it. The CAPI Machine
# controller waits forever for the annotation to be removed, blocking the entire
# deletion cascade (Machine → MachineSet → CAPI Cluster → HostedCluster).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add the hjypershift bug here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://redhat-external.slack.com/archives/C01C8502FMM/p1779892666472019

tl:dr

Details:
We hit an infinite deadlock deleting a HostedCluster on the Agent platform. The HostedCluster has been stuck in deletion for hours.

Root cause: The cluster-api-provider-agent controller runs as a deployment inside the control plane namespace. During HostedCluster deletion, the delete() function in
hostedcluster_controller.go:3334 (https://github.com/openshift/hypershift/blob/main/hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go#L3334) follows this sequence:


Delete the CAPI Cluster CR and wait for it to disappear
Once the CAPI Cluster is gone, remove finalizers from managed deployments including the capi-provider deployment — this allows the capi-provider-agent pod to be garbage collected
Delete the HCP and wait
Delete the control plane namespace and wait


The problem: the CAPI Cluster CR can disappear (step 1) while the AgentCluster CR still exists with its [agentclustercapi-provider.agent-install.openshift.io/deprovision](http://agentclustercapi-provider.agent-install.openshift.io/deprovision) finalizer. The CAPI
manager deletes the AgentCluster but doesn't wait for it to be fully gone before removing its own finalizer. Once step 2 removes the deployment finalizers, the capi-provider-agent pod gets
killed. Now nobody is alive to remove the AgentCluster's finalizer. The namespace can't terminate (it has a resource with a finalizer), and delete() loops at step 4 forever: "Waiting for
namespace deletion".

This is the exact same class of bug that was already fixed for Karpenter. karpenter.go:88-140
(https://github.com/openshift/hypershift/blob/main/hypershift-operator/controllers/hostedcluster/karpenter.go#L88-L140) has resolveKarpenterFinalizer() with the comment:

▎ "Without this fallback the HCP would be stuck in terminating with the karpenter finalizer blocking deletion indefinitely."

def _check_deleted() -> bool:
_force_cleanup_agentcluster_finalizers(k8s=k8s, name=name)
_force_cleanup_agent_labels(k8s=k8s, name=name)
_force_cleanup_machine_preterminate_hooks(k8s=k8s, name=name)
return not k8s.is_present(resource="clusterorder", name=name)

poll_until(
Expand Down Expand Up @@ -290,6 +296,23 @@ def _force_cleanup_agent_labels(*, k8s: K8sClient, name: str) -> None:
)


def _force_cleanup_machine_preterminate_hooks(*, k8s: K8sClient, name: str) -> None:
cp_ns = f"{k8s.namespace}-{name}-{name}"
hook = "pre-terminate.delete.hook.machine.cluster.x-k8s.io/agentmachine"
base_args = [*k8s._base(), "--as", "system:admin"]
output, rc = run_unchecked(
*base_args, "get", "machines.cluster.x-k8s.io",
"-n", cp_ns, "-o", "jsonpath={.items[*].metadata.name}",
)
if rc != 0 or not output.strip():
return
for machine_name in output.strip().split():
run_unchecked(
*base_args, "annotate", f"machines.cluster.x-k8s.io/{machine_name}",
"-n", cp_ns, f"{hook}-",
)


def wait_for_cluster_grpc_removal(*, grpc: GRPCClient, uuid: str) -> None:
poll_until(
fn=lambda: uuid not in grpc.list_cluster_ids(),
Expand Down
Loading