Increase robustness for the Managed DHCP add-on under disastrous events #54

davidepasquero · 2025-06-19T14:54:42Z

Problem
If a node encounters catastrophic events, such as the kubelet going down for no reason or the entire node crashing the controller does not forcefully remove agent pods stuck in termination state and no new agents are deployed, VMs that are reactivated on another node can no longer get their original IP addresses from the agent, if a downstream cluster consists of the affected VMs, this will be a catastrophe.

Solution

Below are the key changes with references to the affected files and line numbers.

A helper to instantly delete pods was added and used wherever stale pods are removed:

Helper definition in pkg/controller/ippool/controller.go lines 483‑485

Called when redeploying an agent pod if a previous instance is stuck in Terminating (lines 308‑312)

Used when purging outdated pods during monitoring (lines 444‑456)

Applied during cleanup of an IPPool (lines 488‑495)

Automatic cleanup routine now patches away pod finalizers before force‑deleting:

Function CleanupTerminatingPods implements this logic (lines 15‑58)

Executed on startup by controller and webhook binaries (controller run.go line 51, webhook run.go line 81)

DHCP servers can be stopped explicitly:

Public method Stop added at pkg/dhcp/dhcp.go lines 321‑334

The agent controller uses the new Stop to terminate DHCP when an IPPool isn’t ready:

Implementation in pkg/agent/ippool/ippool.go lines 10‑17

Network interface information is tracked by agent controllers to stop the correct DHCP instance:

New nic field in the event handler struct and constructor (lines 26‑37 and 46‑62)

Controller receives the interface at creation time (lines 118‑123)

A node controller was introduced to remove agent pods from nodes that turn unready:

Controller logic resides in pkg/controller/node/controller.go (lines 18‑86 show the constants, registration, and readiness check)

Registration of this controller happens in pkg/controller/setup.go (lines 15‑19)

These changes ensure stuck pods are force‑deleted, DHCP services can be stopped cleanly, and agents on failing nodes are removed automatically.

Related Issue:
harvester/harvester#8205

Signed-off-by: Zespre Chang <[email protected]> Signed-off-by: Pasquero Davide 2204 <[email protected]>

Support image name with registry prefix containing port number or without tag (default to latest) Signed-off-by: Pasquero Davide 2204 <[email protected]>

Signed-off-by: Zespre Chang <[email protected]> Signed-off-by: Pasquero Davide 2204 <[email protected]>

Signed-off-by: Pasquero Davide 2204 <[email protected]>

mergify · 2025-06-20T03:38:33Z

This pull request is now in conflict. Could you fix it @davidepasquero? 🙏

starbops

Thanks for the PR, @davidepasquero. However, we're more inclined to introduce the leader election mechanism, which is the norm in Kubernetes and is widely adopted by custom controllers across the Harvester ecosystem. We will transition the agents from pods to deployments and utilize the Kubernetes deployment controller to handle the heavy lifting of managing the pod lifecycle. This will address the situation of a node down incident, which prevents agent pods from being rescheduled and run on other healthy nodes until human intervention, i.e., manually deleting the agent pods stuck in the terminating state.

davidepasquero · 2025-07-04T23:47:58Z

Thank you very much for the detailed feedback and the clear direction suggested.

I have carefully read the comment and I fully agree that introducing a leader election mechanism and transitioning agents from Pod to Deployment is a much more robust and standard approach for the Kubernetes and Harvester ecosystem.

I fully understand how using the Kubernetes Deployment controller can natively and more effectively manage the lifecycle of pods, solving the root problem of agents stuck on a node no longer reachable.

I propose a working implementation in which I transformed the agents into Deployment and implemented the leader election. All the details are reported in the new PR #61 hat I ask you to see. thanks

…-di-elezione-del-leader Add agent deployment support

…canismo-di-elezione-del-leader Fix agent deployment replicas

…canismo-di-elezione-del-leader Fix import usage for agent deployment

starbops · 2025-07-09T06:05:46Z

@davidepasquero Thank you. I left some comments in #61, PTAL. Regarding the successor PR, I will close this one as obsolete. Feel free to reopen if you believe it's still relevant. Thank you.

davidepasquero · 2025-07-09T17:05:59Z

Hi, Thanks for your patience. I'll need a bit more time to reorganize my thoughts and provide a clearer explanation. I appreciate your understanding. Thanks, Davide Il giorno mer 9 lug 2025 alle ore 08:06 Zespre Chang < ***@***.***> ha scritto:

…

*starbops* left a comment (harvester/vm-dhcp-controller#54) <#54 (comment)> @davidepasquero <https://github.com/davidepasquero> Thank you. I left some comments in #61 <#61>, PTAL. Regarding the successor PR, I will close this one as obsolete. Feel free to reopen if you believe it's still relevant. Thank you. — Reply to this email directly, view it on GitHub <#54 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANZRW3FHUKQ63KJWXNG7NM33HSWNBAVCNFSM6AAAAAB7WEQSTWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTANJRGI2DGMJUGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

starbops and others added 11 commits June 19, 2025 17:14

test(vmnetcfg): adapt to ippool and nad decoupling

1a96bfa

Signed-off-by: Zespre Chang <[email protected]> Signed-off-by: Pasquero Davide 2204 <[email protected]>

test(vmnetcfg): add complex test case for sync handler

b967c06

Signed-off-by: Zespre Chang <[email protected]> Signed-off-by: Pasquero Davide 2204 <[email protected]>

ci: ensure ci also triggered by pushes on stable branches

fe28d55

Signed-off-by: Zespre Chang <[email protected]> Signed-off-by: Pasquero Davide 2204 <[email protected]>

fix(controller): agent image name parsing

4c37993

Support image name with registry prefix containing port number or without tag (default to latest) Signed-off-by: Pasquero Davide 2204 <[email protected]>

ci: use ubuntu 24.04 as 20.04 was unsupported

bdd4cb3

Signed-off-by: Zespre Chang <[email protected]> Signed-off-by: Pasquero Davide 2204 <[email protected]>

Add force delete helper and update agent pod logic

631173b

Signed-off-by: Pasquero Davide 2204 <[email protected]>

file per vscode

0c74443

Signed-off-by: Pasquero Davide 2204 <[email protected]>

add Stop method and stop DHCP server when IPPool not ready

c76fdc4

Signed-off-by: Pasquero Davide 2204 <[email protected]>

feat: delete agent pods when node not ready

08665be

Signed-off-by: Pasquero Davide 2204 <[email protected]>

feat: cleanup stuck controller pods

dbf39ad

Signed-off-by: Pasquero Davide 2204 <[email protected]>

cleanup: remove stuck pod finalizers

2621936

Signed-off-by: Pasquero Davide 2204 <[email protected]>

davidepasquero force-pushed the main branch from 3956fa9 to 2621936 Compare June 19, 2025 15:15

starbops reviewed Jun 20, 2025

View reviewed changes

davidepasquero mentioned this pull request Jul 4, 2025

Introduce Leader Election and use Deployments for agents #61

Open

davidepasquero and others added 7 commits July 5, 2025 04:53

Deploy agent as Kubernetes Deployment

a9ad81a

Merge pull request #7 from davidepasquero/codex/introdurre-meccanismo…

08a1c04

…-di-elezione-del-leader Add agent deployment support

Fix deployment replica pointer

b01a0eb

prhpdz-codex/introdurre-meccanismo-di-elezione-del-leader

744fcea

Merge pull request #8 from davidepasquero/prhpdz-codex/introdurre-mec…

b0ec61d

…canismo-di-elezione-del-leader Fix agent deployment replicas

refactor: move agent replica constant

6a7d223

Merge pull request #9 from davidepasquero/j3vkip-codex/introdurre-mec…

61a33c3

…canismo-di-elezione-del-leader Fix import usage for agent deployment

starbops closed this Jul 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increase robustness for the Managed DHCP add-on under disastrous events #54

Increase robustness for the Managed DHCP add-on under disastrous events #54

Uh oh!

davidepasquero commented Jun 19, 2025

Uh oh!

mergify bot commented Jun 20, 2025

Uh oh!

starbops left a comment

Uh oh!

davidepasquero commented Jul 4, 2025 •

edited

Loading

Uh oh!

starbops commented Jul 9, 2025

Uh oh!

davidepasquero commented Jul 9, 2025 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Increase robustness for the Managed DHCP add-on under disastrous events #54

Increase robustness for the Managed DHCP add-on under disastrous events #54

Uh oh!

Conversation

davidepasquero commented Jun 19, 2025

Uh oh!

mergify bot commented Jun 20, 2025

Uh oh!

starbops left a comment

Choose a reason for hiding this comment

Uh oh!

davidepasquero commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

starbops commented Jul 9, 2025

Uh oh!

davidepasquero commented Jul 9, 2025 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davidepasquero commented Jul 4, 2025 •

edited

Loading