Skip to content

Conversation

davidepasquero
Copy link

Problem
If a node encounters catastrophic events, such as the kubelet going down for no reason or the entire node crashing the controller does not forcefully remove agent pods stuck in termination state and no new agents are deployed, VMs that are reactivated on another node can no longer get their original IP addresses from the agent, if a downstream cluster consists of the affected VMs, this will be a catastrophe.

Solution

Below are the key changes with references to the affected files and line numbers.

A helper to instantly delete pods was added and used wherever stale pods are removed:

Helper definition in pkg/controller/ippool/controller.go lines 483‑485

Called when redeploying an agent pod if a previous instance is stuck in Terminating (lines 308‑312)

Used when purging outdated pods during monitoring (lines 444‑456)

Applied during cleanup of an IPPool (lines 488‑495)

Automatic cleanup routine now patches away pod finalizers before force‑deleting:

Function CleanupTerminatingPods implements this logic (lines 15‑58)

Executed on startup by controller and webhook binaries (controller run.go line 51, webhook run.go line 81)

DHCP servers can be stopped explicitly:

Public method Stop added at pkg/dhcp/dhcp.go lines 321‑334

The agent controller uses the new Stop to terminate DHCP when an IPPool isn’t ready:

Implementation in pkg/agent/ippool/ippool.go lines 10‑17

Network interface information is tracked by agent controllers to stop the correct DHCP instance:

New nic field in the event handler struct and constructor (lines 26‑37 and 46‑62)

Controller receives the interface at creation time (lines 118‑123)

A node controller was introduced to remove agent pods from nodes that turn unready:

Controller logic resides in pkg/controller/node/controller.go (lines 18‑86 show the constants, registration, and readiness check)

Registration of this controller happens in pkg/controller/setup.go (lines 15‑19)

These changes ensure stuck pods are force‑deleted, DHCP services can be stopped cleanly, and agents on failing nodes are removed automatically.

Related Issue:
harvester/harvester#8205

starbops and others added 11 commits June 19, 2025 17:14
Signed-off-by: Zespre Chang <[email protected]>
Signed-off-by: Pasquero Davide 2204 <[email protected]>
Signed-off-by: Zespre Chang <[email protected]>
Signed-off-by: Pasquero Davide 2204 <[email protected]>
Signed-off-by: Zespre Chang <[email protected]>
Signed-off-by: Pasquero Davide 2204 <[email protected]>
Support image name with registry prefix containing port number or
without tag (default to latest)

Signed-off-by: Pasquero Davide 2204 <[email protected]>
Signed-off-by: Zespre Chang <[email protected]>
Signed-off-by: Pasquero Davide 2204 <[email protected]>
Signed-off-by: Pasquero Davide 2204 <[email protected]>
Signed-off-by: Pasquero Davide 2204 <[email protected]>
Signed-off-by: Pasquero Davide 2204 <[email protected]>
Copy link

mergify bot commented Jun 20, 2025

This pull request is now in conflict. Could you fix it @davidepasquero? 🙏

Copy link
Member

@starbops starbops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @davidepasquero. However, we're more inclined to introduce the leader election mechanism, which is the norm in Kubernetes and is widely adopted by custom controllers across the Harvester ecosystem. We will transition the agents from pods to deployments and utilize the Kubernetes deployment controller to handle the heavy lifting of managing the pod lifecycle. This will address the situation of a node down incident, which prevents agent pods from being rescheduled and run on other healthy nodes until human intervention, i.e., manually deleting the agent pods stuck in the terminating state.

@davidepasquero
Copy link
Author

davidepasquero commented Jul 4, 2025

Thank you very much for the detailed feedback and the clear direction suggested.

I have carefully read the comment and I fully agree that introducing a leader election mechanism and transitioning agents from Pod to Deployment is a much more robust and standard approach for the Kubernetes and Harvester ecosystem.

I fully understand how using the Kubernetes Deployment controller can natively and more effectively manage the lifecycle of pods, solving the root problem of agents stuck on a node no longer reachable.

I propose a working implementation in which I transformed the agents into Deployment and implemented the leader election. All the details are reported in the new PR #61 hat I ask you to see. thanks

@starbops
Copy link
Member

starbops commented Jul 9, 2025

@davidepasquero Thank you. I left some comments in #61, PTAL. Regarding the successor PR, I will close this one as obsolete. Feel free to reopen if you believe it's still relevant. Thank you.

@starbops starbops closed this Jul 9, 2025
@davidepasquero
Copy link
Author

davidepasquero commented Jul 9, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants