MachinePool: avoid SetNotReady during normal processing #5537

mweibel · 2025-04-02T12:58:18Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
Adjusts setting of Ready state of the AzureMachinePool based on provisioningState. Most of the provisioningStates do not prohibit scaling of the VMSS. When ready is set to false however, the CAPI MachinePool does not get reconciled anymore and e.g. the providerIDList is not updated. This has the effect of not doing processing of existing/new Machines.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4982

TODOs:

squashed commits
includes documentation
adds unit tests
cherry-pick candidate

Release note:

fixes ready state of AzureMachinePool to avoid failing to reconcile a working VMSS. This improves reconciliation of MachinePools especially during scale up/down operations.

k8s-ci-robot · 2025-04-02T12:58:28Z

Hi @mweibel. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

codecov · 2025-04-02T13:08:11Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 53.05%. Comparing base (8a1f692) to head (0fb07b3).
Report is 91 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5537      +/-   ##
==========================================
+ Coverage   52.94%   53.05%   +0.11%     
==========================================
  Files         272      272              
  Lines       29525    29526       +1     
==========================================
+ Hits        15631    15665      +34     
+ Misses      13087    13054      -33     
  Partials      807      807

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mweibel · 2025-04-09T06:32:51Z

FYI I removed the draft state. We're running this since a couple of days in production and it works well so far.

willie-yao · 2025-04-09T20:36:15Z

/assign
/ok-to-test

willie-yao

This looks like a great improvement to me, apologies for the delays here!

/lgtm
/assign @nawazkh

k8s-ci-robot · 2025-04-14T18:26:54Z

LGTM label has been added.

Git tree hash: 6b8935af5490192b4ecb7bb6d8ebf881247a7fb5

willie-yao · 2025-04-14T18:28:13Z

/test pull-cluster-api-provider-azure-e2e-optional
/test pull-cluster-api-provider-azure-e2e-workload-upgrade

nawazkh

Great work here @mweibel !
I think we should also support this change with an e2e test, what do you say @mweibel ?

nawazkh · 2025-04-14T20:23:23Z

azure/scope/machinepool.go

+	case v == infrav1.Failed:
+		conditions.MarkFalse(m.AzureMachinePool, infrav1.ScaleSetRunningCondition, infrav1.ScaleSetProvisionFailedReason, clusterv1.ConditionSeverityInfo, "")


Shouldn't we be marking the resource as NotReady when the resource's Provisioning state is infrav1.Failed ?

from what I gathered in my experiments, the VMSS marks itself as failed if one VM has failed provisoning state. If we mark the VMSS in this case as failed, further reconciliation is prevented until that one VM is either removed from the VMSS, or the VM goes into state succeeded again.

Because a VMSS can still scale up and down even if provisioningState is failed, it's wrong to mark the whole AMP as failed and prevent reconciling.

I explicitely didn't add m.SetReady() here because in case the VMSS is already in NotReady state due to something else, we shouldn't reset that flag when the provisioningState is failed.

Self notes:
I agree that a VMSS can still scale up and down even if provisioningState is failed and marking it as failed is incorrect.
But is setting an AMP's status to NotReady equivalent to marking it as failed ?
Will probe..

Also, in the scenario AzureMachinePool.Status.ProvisioningState == infrav1.Failed shouldn't we be setting the ready status to m.SetReady() so that the reconciler does not carry over the earlier state?

re last comment: yeah this is where I'm not 100% sure. I didn't mark it as Ready here because if we have a failed state before we may not want to mark the AMP as ready. It's however related to what you mentioned in another comment: the state handling is not good enough. However I couldn't yet see a way to reliably determine if a VMSS is unable to scale.

nawazkh · 2025-04-14T20:30:38Z

azure/scope/machinepool.go

@@ -615,19 +615,20 @@ func (m *MachinePoolScope) setProvisioningStateAndConditions(v infrav1.Provision
 		} else {
 			conditions.MarkFalse(m.AzureMachinePool, infrav1.ScaleSetDesiredReplicasCondition, infrav1.ScaleSetScaleDownReason, clusterv1.ConditionSeverityInfo, "")
 		}
-		m.SetNotReady()
+		m.SetReady()


This change is probably evolving from the observation listed in #5515, i.e.

We are observing issues with a customer that sees NICs sometimes enter a ProvisioningFailed state yet continue operating, which then cascades to prevent any further action on the dependent resources, such as the virtual machines.

Shouldn't the right approach in addressing this be something along the lines of below ?

These states are metadata properties of the resource. They're independent from the functionality of the resource itself. Being in the failed state doesn't necessarily mean that the resource isn't functional. In most cases, it can continue operating and serving traffic without issues.

In several scenarios, if the resource is in the failed state, further operations on the resource or on other resources that depend on it might fail. You need to revert the state back to succeeded before running other operations.
...
To restore succeeded state, run another write (PUT) operation on the resource.

The issue that caused the previous operation might no longer be current. The newer write operation should be successful and restore the provisioning state.

Reference: #5515

I saw the referenced issue but didn't investigate if the underlying issue is similar. From my understanding (what I wrote in the other comment reply), a VMSS might be different in that it makes it's provisioningState dependant on the provisioningStates of the running VMs/Instances in it.
Marking the VMSS as not ready has many implications, most notably that providerIdList is not processed anymore and therefore having dangling VMs lying around which don't get into CAPI/CAPZ at all.

a VMSS might be different in that it makes it's provisioningState dependant on the provisioningStates of the running VMs/Instances in it.
Marking the VMSS as not ready has many implications, most notably that providerIdList is not processed anymore and therefore having dangling VMs lying around which don't get into CAPI/CAPZ at all.

Thank you for providing context on this.

As a users, it seems quite odd to me that the CAPZ controller would mark the AzureMachinePool's Status as Ready when

*m.MachinePool.Spec.Replicas != m.AzureMachinePool.Status.Replicas

infrav1.ProvisioningState == infrav1.Updating

If I were to stretch my understanding, it is sort of acceptable for AzureMachinePool to mark itself ready when infrav1.ProvisioningState == infrav1.Succeeded and the *m.MachinePool.Spec.Replicas != m.AzureMachinePool.Status.Replicas. This implies to me that Azure has acknowledged AzureMachinePools request(hence infrav1.ProvisioningState == infrav1.Succeeded) and is working to get the desired replicas.

However, when infrav1.ProvisioningState == infrav1.Updating, it does not seem right to be broadcasting that AzureMachinePool is Ready when Azure is clearly working to get the VMs assigned and added to the VMSS.

Is my understanding wrong here ?

most notably that providerIdList is not processed anymore and therefore having dangling VMs lying around which don't get into CAPI/CAPZ at all.

I need to probe this further, but relying on its own (AMP) status to progress in its reconciliation logic is not a good pattern. It will only lead to more dependence on its own status.

I think I outlined what happens when the AMP is not ready in the issue here: #4982 (comment)

see also the code from CAPI: https://github.com/kubernetes-sigs/cluster-api/blob/8d639f1fad564eecf5bda0a2ee03c8a38896a184/exp/internal/controllers/machinepool_controller_phases.go#L290-L319

From what I understand:

The .Status.Ready is used by CAPI to determine if it should be reconciled.

If not ready the whole MP does not get reconciled anymore

The new v1beta2 status make it much more clear (Initialization status).

This change here ensures that we don't avoid processing the AMP/MP with the current API version.

As a users, it seems quite odd to me that the CAPZ controller would mark the AzureMachinePool's Status as Ready when

I always read the AMP Ready status as: It's possible to work with the AMP, i.e. scale up and down is possible. Which is the case, even if with a Replica mismatch or a updating provisioningState.

The AMP Ready status is read by the CAPI MP controller. Therefore it has implications broader than just for the CAPZ controller. I think the Replicas difference or the ProvisioningState difference should not be reflected in the Ready flag but instead with conditions.

Does that make sense or do I see something not?

Sorry for the delay on getting back on this.

The AMP Ready status is read by the CAPI MP controller. Therefore it has implications broader than just for the CAPZ controller.

That makes sense. It is not fault-tolerant for a resource controller to determine next steps based on its own status, rather than spec.

I think the Replicas difference or the ProvisioningState difference should not be reflected in the Ready flag but instead with conditions.

I agree with this.

The logic to update the status of a MachinePool needs to be revisited in my opinion. It needs to be better. Sorry to push back on this so much.

The area of controversy to me is that the way we are updating the status of MachinePool even when it is in Updating state or if the total replicas requested is more than current.
For instance, is it valid to update the status of AzureMachinePool to Ready when the actual number of replicas is 0, but the desired replica count is greater than 1 ? Meaning, when the Azure Machine Pool is still spinning up ?

@nawazkh I totally see your point but I think the ready status as it was before this change was mis-interpreted though it's not exactly clear in the contract.

Looking at related paragraph in the contract:

Required status fields

The status object must have at least one field defined:

ready - a boolean field indicating if the infrastructure is ready to be used or not.

My interpretation of this: As soon as the MachinePool got its first node running (and therefore validated it works), the ready state should go to true. It should stay true as long as the MachinePool is able to scale.

The provisioningState of a VMSS is influenced by the provisioningState of each VM inside. If one of them is in Failed or Updating state, the VMSS also reflects that in it's own provisioningState. It however doesn't mean that the VMSS is unable to scale or function. When we mark the AzureMachinePool as not ready, this marks the MachinePool as unable to scale.

I do agree about the need to rethink the status handling though. We don't really have a clear way of knowing when a VMSS is not ready to scale.

mweibel · 2025-04-22T11:14:21Z

Great work here @mweibel ! I think we should also support this change with an e2e test, what do you say @mweibel ?

@nawazkh yeah I thought about that too. I do wonder how we could e2e test that however. We'd need a way to simulate a failing provisioning state to fully e2e test this, or did you have another idea?

mweibel · 2025-04-23T06:57:13Z

/retest

nawazkh · 2025-04-23T22:23:30Z

Great work here @mweibel ! I think we should also support this change with an e2e test, what do you say @mweibel ?

@nawazkh yeah I thought about that too. I do wonder how we could e2e test that however. We'd need a way to simulate a failing provisioning state to fully e2e test this, or did you have another idea?

Mocking would be a great alternative to having a full blown e2e and simulating a failing VMSS; especially since Azure doesn't really let the users change the status of a resource as per their wish. And I think your unit test is achieving the required test scenarios. Thank you for adding those tests 😃

nawazkh · 2025-04-24T21:40:54Z

/test pull-cluster-api-provider-azure-e2e-optional

mboersma · 2025-04-30T14:23:15Z

/test pull-cluster-api-provider-azure-apiversion-upgrade

nawazkh · 2025-04-30T20:07:52Z

/test pull-cluster-api-provider-azure-e2e-optional

mweibel · 2025-05-05T09:19:37Z

is there anything else I can do to make this move along?

nawazkh · 2025-05-06T17:12:27Z

azure/scope/machinepool.go

 	case v == infrav1.Updating:
 		conditions.MarkFalse(m.AzureMachinePool, infrav1.ScaleSetModelUpdatedCondition, infrav1.ScaleSetModelOutOfDateReason, clusterv1.ConditionSeverityInfo, "")
-		m.SetNotReady()
+		m.SetReady()


Could you please add a comment on rational behind changing this behavior ?

see my other comment - if one VM is in Updating state, the provisioningState of the VMSS is also Updating. This doesn't mean the VMSS is unable to scale though.

nawazkh · 2025-05-06T17:16:56Z

is there anything else I can do to make this move along?

Sorry for the delay on getting back to this PR. Added some final questions on this PR. Please take a look!

nawazkh · 2025-06-02T16:28:13Z

/lgtm
^ from my side.
@willie-yao, @mboersma or @jackfrancis : please give this another review.

k8s-ci-robot · 2025-06-02T16:28:19Z

LGTM label has been added.

Git tree hash: a23cf13b62fe6892b3546e65dc7b159d9b244147

willie-yao · 2025-06-02T16:56:10Z

/lgtm
/approve

k8s-ci-robot · 2025-06-02T16:56:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: willie-yao

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [willie-yao]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

willie-yao · 2025-06-02T18:57:37Z

/retest

mweibel · 2025-06-03T06:21:53Z

/retest

willie-yao · 2025-06-03T22:18:39Z

/retest

mweibel · 2025-06-04T08:27:11Z

/retest

are they flaky recently?

mboersma

/override pull-cluster-api-provider-azure-apiversion-upgrade

We hope this test will be fixed by #5528 (see comment).

k8s-ci-robot · 2025-06-04T15:49:05Z

@mboersma: Overrode contexts on behalf of mboersma: pull-cluster-api-provider-azure-apiversion-upgrade

In response to this:

/override pull-cluster-api-provider-azure-apiversion-upgrade

We hope this test will be fixed by #5528 (see comment).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

github-project-automation bot added this to CAPZ Planning Apr 2, 2025

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Apr 2, 2025

github-project-automation bot moved this to Todo in CAPZ Planning Apr 2, 2025

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 2, 2025

k8s-ci-robot requested review from jackfrancis and mboersma April 2, 2025 12:58

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 2, 2025

mweibel force-pushed the fix-mp-ready branch 2 times, most recently from 861d905 to f424888 Compare April 4, 2025 12:37

mweibel marked this pull request as ready for review April 9, 2025 06:32

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 9, 2025

k8s-ci-robot requested a review from willie-yao April 9, 2025 06:32

k8s-ci-robot assigned willie-yao Apr 9, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 9, 2025

willie-yao approved these changes Apr 14, 2025

View reviewed changes

k8s-ci-robot assigned nawazkh Apr 14, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 14, 2025

willie-yao moved this from Todo to Needs Review in CAPZ Planning Apr 14, 2025

nawazkh reviewed Apr 14, 2025

View reviewed changes

MachinePool: avoid SetNotReady during normal processing

0fb07b3

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 22, 2025

k8s-ci-robot requested review from nawazkh and willie-yao April 22, 2025 07:01

nawazkh mentioned this pull request Apr 23, 2025

AzureMachinePoolMachine: avoid setting FailureReason/FailureMessage #5558

Merged

4 tasks

kubernetes-sigs deleted a comment from k8s-ci-robot Apr 30, 2025

nawazkh reviewed May 6, 2025

View reviewed changes

nawazkh moved this from Needs Review to Wait-On-Author in CAPZ Planning May 8, 2025

nawazkh added this to the v1.20 milestone May 8, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 2, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 2, 2025

mboersma approved these changes Jun 4, 2025

View reviewed changes

k8s-ci-robot merged commit ac1868f into kubernetes-sigs:main Jun 4, 2025
24 checks passed

github-project-automation bot moved this from Wait-On-Author to Done in CAPZ Planning Jun 4, 2025

		case v == infrav1.Failed:
		conditions.MarkFalse(m.AzureMachinePool, infrav1.ScaleSetRunningCondition, infrav1.ScaleSetProvisionFailedReason, clusterv1.ConditionSeverityInfo, "")

MachinePool: avoid SetNotReady during normal processing #5537

MachinePool: avoid SetNotReady during normal processing #5537

Conversation

mweibel commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Apr 2, 2025

Uh oh!

codecov bot commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mweibel commented Apr 9, 2025

Uh oh!

willie-yao commented Apr 9, 2025

Uh oh!

willie-yao left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Apr 14, 2025

Uh oh!

willie-yao commented Apr 14, 2025

Uh oh!

nawazkh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mweibel Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mweibel Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mweibel commented Apr 22, 2025

Uh oh!

mweibel commented Apr 23, 2025

Uh oh!

nawazkh commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nawazkh commented Apr 24, 2025

Uh oh!

mboersma commented Apr 30, 2025

Uh oh!

nawazkh commented Apr 30, 2025

Uh oh!

mweibel commented May 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nawazkh commented May 6, 2025

Uh oh!

nawazkh commented Jun 2, 2025

mweibel commented Apr 2, 2025 •

edited

Loading

codecov bot commented Apr 2, 2025 •

edited

Loading

mweibel Apr 22, 2025 •

edited

Loading

mweibel Apr 25, 2025 •

edited

Loading

nawazkh commented Apr 23, 2025 •

edited

Loading