Skip to content

Conversation

@zhanggbj
Copy link
Contributor

@zhanggbj zhanggbj commented Oct 28, 2025

What this PR does / why we need it:
Support Node Auto Placement and Node AF/AAF

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

zhanggbj and others added 11 commits October 22, 2025 14:19
- Bump VMOP including Node AF/AAF support
- Add NodeAutoPlacement Feature Gate

(cherry picked from commit 700c8ae)
Removes the extra cases for VMG creation, such that VMG is created for:
1. Multiple zones, multiple MDs with no failureDomain
2. Multiple zones, multiple MDs with failureDomain
3. Single zone, existing cluster with no failureDomain MDs

Signed-off-by: Sagar Muchhal <[email protected]>
- Updates VMOP API dependency

Misc VMG fixes
- Use namingStrategy to calculate VM names
- Use MachineDeployment names for VMG placement label
- Includes all machinedeployments to generate node-pool -> zone
  mapping

Fixes VMG webhook validation error
- Adds cluster-name label to Af/AAF spec
- re-adds zone topology key back to anti-aff spec

Signed-off-by: Sagar Muchhal <[email protected]>
…gs#71)

* Refine VMG controller when generate per-MD zone labels

- Skip legacy already-placed VM which do not have placement info
- Skip VM which do not have zone info

* Apply suggestions from code review

---------

Co-authored-by: Sagar Muchhal <[email protected]>
- Sync VSphereMachines during day-2 operations in VMG controller
- Only wait for all intended VSphereMachines during initial Cluster creation
- Use annotations in VMG for per-md-zone info

Signed-off-by: Gong Zhang <[email protected]>
- Add VMG recociler unit test
- Bump VMOP due to API change
- Filter out VSphereMachine event except create/delete events

Signed-off-by: Gong Zhang <[email protected]>
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 28, 2025
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 28, 2025
@zhanggbj zhanggbj changed the title [WIP][DoNotReview]Support Node Auto Placement and Node AF/AAF [WIP][DoNotReview]✨ Support Node Auto Placement and Node AF/AAF Oct 28, 2025
@zhanggbj zhanggbj force-pushed the node_auto_placement branch 3 times, most recently from 1160ce9 to 1cd61f9 Compare October 28, 2025 10:31
@zhanggbj
Copy link
Contributor Author

/test ?

@k8s-ci-robot
Copy link
Contributor

@zhanggbj: The following commands are available to trigger required jobs:

/test pull-cluster-api-provider-vsphere-e2e-govmomi-blocking-main
/test pull-cluster-api-provider-vsphere-e2e-govmomi-conformance-ci-latest-main
/test pull-cluster-api-provider-vsphere-e2e-govmomi-conformance-main
/test pull-cluster-api-provider-vsphere-e2e-govmomi-main
/test pull-cluster-api-provider-vsphere-e2e-govmomi-upgrade-1-34-1-35-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-blocking-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-conformance-ci-latest-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-conformance-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-upgrade-1-34-1-35-main
/test pull-cluster-api-provider-vsphere-e2e-vcsim-govmomi-main
/test pull-cluster-api-provider-vsphere-e2e-vcsim-supervisor-main
/test pull-cluster-api-provider-vsphere-test-main
/test pull-cluster-api-provider-vsphere-verify-main

The following commands are available to trigger optional jobs:

/test pull-cluster-api-provider-vsphere-apidiff-main
/test pull-cluster-api-provider-vsphere-janitor-main

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-provider-vsphere-apidiff-main
pull-cluster-api-provider-vsphere-e2e-govmomi-blocking-main
pull-cluster-api-provider-vsphere-e2e-supervisor-blocking-main
pull-cluster-api-provider-vsphere-e2e-vcsim-govmomi-main
pull-cluster-api-provider-vsphere-e2e-vcsim-supervisor-main
pull-cluster-api-provider-vsphere-test-main
pull-cluster-api-provider-vsphere-verify-main

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@zhanggbj
Copy link
Contributor Author

/test pull-cluster-api-provider-vsphere-e2e-govmomi-blocking-main
/test pull-cluster-api-provider-vsphere-e2e-govmomi-conformance-ci-latest-main
/test pull-cluster-api-provider-vsphere-e2e-govmomi-conformance-main
/test pull-cluster-api-provider-vsphere-e2e-govmomi-main
/test pull-cluster-api-provider-vsphere-e2e-govmomi-upgrade-1-34-1-35-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-blocking-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-conformance-ci-latest-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-conformance-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-upgrade-1-34-1-35-main
/test pull-cluster-api-provider-vsphere-e2e-vcsim-govmomi-main
/test pull-cluster-api-provider-vsphere-e2e-vcsim-supervisor-main
/test pull-cluster-api-provider-vsphere-test-main
/test pull-cluster-api-provider-vsphere-verify-main
/test pull-cluster-api-provider-vsphere-apidiff-main
/test pull-cluster-api-provider-vsphere-janitor-main

@zhanggbj zhanggbj force-pushed the node_auto_placement branch 7 times, most recently from 22604bf to 7c7c1a7 Compare November 3, 2025 08:55
@zhanggbj zhanggbj force-pushed the node_auto_placement branch from ea432ed to dc441f6 Compare November 21, 2025 13:29
Signed-off-by: Gong Zhang <[email protected]>
Copy link
Contributor Author

@zhanggbj zhanggbj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All comments are addressed, including the ones missed by mistake last time.

zhanggbj

This comment was marked as spam.

@zhanggbj zhanggbj force-pushed the node_auto_placement branch 2 times, most recently from 8dabe39 to 19ebccb Compare November 23, 2025 13:27
Signed-off-by: Gong Zhang <[email protected]>
@zhanggbj zhanggbj force-pushed the node_auto_placement branch from 19ebccb to b981168 Compare November 23, 2025 14:16
@zhanggbj
Copy link
Contributor Author

/test pull-cluster-api-provider-vsphere-e2e-vcsim-supervisor-main

@zhanggbj zhanggbj force-pushed the node_auto_placement branch 3 times, most recently from 42a6d73 to 3ed2e93 Compare November 24, 2025 14:46
@zhanggbj
Copy link
Contributor Author

/test pull-cluster-api-provider-vsphere-e2e-govmomi-blocking-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-blocking-main

@zhanggbj
Copy link
Contributor Author

Hey @fabriziopandini @sbueringer ,
All comments are addressed, including the ones missed by mistake last time. Now CI are green. Please help to take another look, really appreciate!

For the concern about race conditions, I have implemented checks to mitigate race conditions during initial and post-placement, and also have unit teststo cover the scenarios.

@zhanggbj zhanggbj force-pushed the node_auto_placement branch from 3ed2e93 to 9b1754a Compare November 25, 2025 06:44
@zhanggbj
Copy link
Contributor Author

Pushed a small change and checking UT failures.

@zhanggbj zhanggbj force-pushed the node_auto_placement branch from 9b1754a to 2ac47d4 Compare November 25, 2025 09:25
@zhanggbj
Copy link
Contributor Author

/test pull-cluster-api-provider-vsphere-e2e-supervisor-blocking-main

@zhanggbj
Copy link
Contributor Author

/test pull-cluster-api-provider-vsphere-e2e-govmomi-blocking-main

Since Encryption Class requires API in this newer version, bump vm-operator package

Signed-off-by: Gong Zhang <[email protected]>
@zhanggbj zhanggbj force-pushed the node_auto_placement branch from fd72f09 to c41abeb Compare November 26, 2025 07:49
@zhanggbj zhanggbj force-pushed the node_auto_placement branch from b8c569e to 08b8304 Compare November 27, 2025 09:23
continue
}
if member.Placement != nil && member.Placement.Zone != "" {
log.Info(fmt.Sprintf("MachineDeployment %s has been placed to failure domanin %s", md.Name, member.Placement.Zone), "MachineDeployment", klog.KObj(&md))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
log.Info(fmt.Sprintf("MachineDeployment %s has been placed to failure domanin %s", md.Name, member.Placement.Zone), "MachineDeployment", klog.KObj(&md))
log.Info(fmt.Sprintf("MachineDeployment %s has been placed to failure domain %s", md.Name, member.Placement.Zone), "MachineDeployment", klog.KObj(&md))

want bool
}{
{
name: "Should create a VMG if all the expected VSphereMachines exists",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a unit test ensuring that a deleting vSphereMachine blocks.
I would suggest to use this test case as a setup and add an additional deleting machine

clusterv1.ClusterNameLabel: cluster.Name,
},
Annotations: map[string]string{
fmt.Sprintf("%s/%s", vmoperator.ZoneAnnotationPrefix, "md3"): "zone1", // failureDomain for md3 is explicitly set by the user
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add an annotation without the ZoneAnnotationPrefix and check it is preserved

Namespace: cluster.Namespace,
Name: cluster.Name,
UID: types.UID("uid"),
Labels: map[string]string{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add an label different than the ClusterNameLabel and check it is preserved

Reason: infrav1.VSphereMachineVirtualMachineWaitingForVirtualMachineGroupV1Beta2Reason,
Message: fmt.Sprintf("Waiting for VSphereMachine's VirtualMachineGroup %s to exist", key),
})
log.V(4).Info(fmt.Sprintf("Waiting for VirtualMachineGroup %s, requeueing", key.Name), "VirtualMachineGroup", klog.KRef(key.Namespace, key.Name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.V(4).Info(fmt.Sprintf("Waiting for VirtualMachineGroup %s, requeueing", key.Name), "VirtualMachineGroup", klog.KRef(key.Namespace, key.Name))
log.V(4).Info(fmt.Sprintf("Waiting for VirtualMachineGroup %s to exist, requeueing", key.Name), "VirtualMachineGroup", klog.KRef(key.Namespace, key.Name))

}

// Set the zone label using the annotation of the per-md zone mapping from VirtualMachineGroup.
// This is for new VMs created during day-2 operations when Node Auto Placement is enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This is for new VMs created during day-2 operations when Node Auto Placement is enabled.
// This is for new VMs created after initial placement decision/with a failureDomain defined by the user.

client.MatchingLabels{clusterv1.ClusterNameLabel: supervisorMachineCtx.Cluster.Name}); err != nil {
return false, err
}
othermMDNames := []string{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
othermMDNames := []string{}
otherMDNames := []string{}

})

// control plane machine is the machine with the control plane label set
Specify("Reconcile valid control plane Machine", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a check that the CP machine is placed in the expected zone

vmGroup *vmoprv1.VirtualMachineGroup
)

BeforeEach(func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we are missing a test for VMG not created yet


machineNotMember.Spec.Bootstrap.DataSecretName = &secretName

By("VirtualMachine is not created")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check the VM actually does not exist, not only result

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants