-
Notifications
You must be signed in to change notification settings - Fork 8
Increase robustness for the Managed DHCP add-on under disastrous events #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Zespre Chang <[email protected]> Signed-off-by: Pasquero Davide 2204 <[email protected]>
Signed-off-by: Zespre Chang <[email protected]> Signed-off-by: Pasquero Davide 2204 <[email protected]>
Signed-off-by: Zespre Chang <[email protected]> Signed-off-by: Pasquero Davide 2204 <[email protected]>
Support image name with registry prefix containing port number or without tag (default to latest) Signed-off-by: Pasquero Davide 2204 <[email protected]>
Signed-off-by: Zespre Chang <[email protected]> Signed-off-by: Pasquero Davide 2204 <[email protected]>
Signed-off-by: Pasquero Davide 2204 <[email protected]>
Signed-off-by: Pasquero Davide 2204 <[email protected]>
Signed-off-by: Pasquero Davide 2204 <[email protected]>
Signed-off-by: Pasquero Davide 2204 <[email protected]>
Signed-off-by: Pasquero Davide 2204 <[email protected]>
Signed-off-by: Pasquero Davide 2204 <[email protected]>
This pull request is now in conflict. Could you fix it @davidepasquero? 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, @davidepasquero. However, we're more inclined to introduce the leader election mechanism, which is the norm in Kubernetes and is widely adopted by custom controllers across the Harvester ecosystem. We will transition the agents from pods to deployments and utilize the Kubernetes deployment controller to handle the heavy lifting of managing the pod lifecycle. This will address the situation of a node down incident, which prevents agent pods from being rescheduled and run on other healthy nodes until human intervention, i.e., manually deleting the agent pods stuck in the terminating state.
Thank you very much for the detailed feedback and the clear direction suggested. I have carefully read the comment and I fully agree that introducing a leader election mechanism and transitioning agents from Pod to Deployment is a much more robust and standard approach for the Kubernetes and Harvester ecosystem. I fully understand how using the Kubernetes Deployment controller can natively and more effectively manage the lifecycle of pods, solving the root problem of agents stuck on a node no longer reachable. I propose a working implementation in which I transformed the agents into Deployment and implemented the leader election. All the details are reported in the new PR #61 hat I ask you to see. thanks |
…-di-elezione-del-leader Add agent deployment support
…canismo-di-elezione-del-leader Fix agent deployment replicas
…canismo-di-elezione-del-leader Fix import usage for agent deployment
@davidepasquero Thank you. I left some comments in #61, PTAL. Regarding the successor PR, I will close this one as obsolete. Feel free to reopen if you believe it's still relevant. Thank you. |
Hi,
Thanks for your patience. I'll need a bit more time to reorganize my
thoughts and provide a clearer explanation. I appreciate your understanding.
Thanks,
Davide
Il giorno mer 9 lug 2025 alle ore 08:06 Zespre Chang <
***@***.***> ha scritto:
… *starbops* left a comment (harvester/vm-dhcp-controller#54)
<#54 (comment)>
@davidepasquero <https://github.com/davidepasquero> Thank you. I left
some comments in #61
<#61>, PTAL.
Regarding the successor PR, I will close this one as obsolete. Feel free to
reopen if you believe it's still relevant. Thank you.
—
Reply to this email directly, view it on GitHub
<#54 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANZRW3FHUKQ63KJWXNG7NM33HSWNBAVCNFSM6AAAAAB7WEQSTWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTANJRGI2DGMJUGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Problem
If a node encounters catastrophic events, such as the kubelet going down for no reason or the entire node crashing the controller does not forcefully remove agent pods stuck in termination state and no new agents are deployed, VMs that are reactivated on another node can no longer get their original IP addresses from the agent, if a downstream cluster consists of the affected VMs, this will be a catastrophe.
Solution
Below are the key changes with references to the affected files and line numbers.
A helper to instantly delete pods was added and used wherever stale pods are removed:
Helper definition in pkg/controller/ippool/controller.go lines 483‑485
Called when redeploying an agent pod if a previous instance is stuck in Terminating (lines 308‑312)
Used when purging outdated pods during monitoring (lines 444‑456)
Applied during cleanup of an IPPool (lines 488‑495)
Automatic cleanup routine now patches away pod finalizers before force‑deleting:
Function CleanupTerminatingPods implements this logic (lines 15‑58)
Executed on startup by controller and webhook binaries (controller run.go line 51, webhook run.go line 81)
DHCP servers can be stopped explicitly:
Public method Stop added at pkg/dhcp/dhcp.go lines 321‑334
The agent controller uses the new Stop to terminate DHCP when an IPPool isn’t ready:
Implementation in pkg/agent/ippool/ippool.go lines 10‑17
Network interface information is tracked by agent controllers to stop the correct DHCP instance:
New nic field in the event handler struct and constructor (lines 26‑37 and 46‑62)
Controller receives the interface at creation time (lines 118‑123)
A node controller was introduced to remove agent pods from nodes that turn unready:
Controller logic resides in pkg/controller/node/controller.go (lines 18‑86 show the constants, registration, and readiness check)
Registration of this controller happens in pkg/controller/setup.go (lines 15‑19)
These changes ensure stuck pods are force‑deleted, DHCP services can be stopped cleanly, and agents on failing nodes are removed automatically.
Related Issue:
harvester/harvester#8205