Skip to content

Conversation

phuhung273
Copy link
Contributor

@phuhung273 phuhung273 commented Apr 18, 2025

What type of PR is this?
/kind feature

What this PR does / why we need it:
Support EKS upgrade policy

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5183

Special notes for your reviewer:

Checklist:

  • squashed commits
  • includes documentation
  • includes emoji in title
  • adds unit tests
  • adds or updates e2e tests

Release note:

Support EKS upgrade policy

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 18, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @phuhung273!

It looks like this is your first PR to kubernetes-sigs/cluster-api-provider-aws 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api-provider-aws has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 18, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @phuhung273. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@phuhung273
Copy link
Contributor Author

Hi @richardcase, can you please help me review this PR if you have time

@richardcase
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 15, 2025
Copy link
Member

@richardcase richardcase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @phuhung273 .

Feel free to ping me on the issue or in slack if you want a hand with anything.

// The default value is EXTENDED. Use STANDARD to disable extended support.
// +kubebuilder:validation:Enum=EXTENDED;STANDARD
// +optional
UpgradePolicy *string `json:"upgradePolicy,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally use a string type alias for something like this and then variables defined for each supported value. Like this: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/controlplane/eks/api/v1beta2/types.go#L147-L157.

Also, the actual string values are normally lowercase so extended instead of EXTENDED. This does mean that when creating the AWS request that some conversion is done, like this for addons: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/pkg/cloud/services/eks/addons.go#L212

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I didn't know this. Updated to use string type alias.

}

if err := wait.WaitForWithRetryable(wait.NewBackoff(), func() (bool, error) {
if _, err := s.EKSClient.UpdateClusterConfig(input); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a UpdateClusterConfig call in the reconciler, it would be better to use this instead of having a new separate update. https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/pkg/cloud/services/eks/cluster.go#L543

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, updated to use current UpdateClusterConfig call

@phuhung273 phuhung273 force-pushed the eks-upgrade-policy branch from 24f8b66 to 353e345 Compare May 17, 2025 10:54
@phuhung273 phuhung273 requested a review from richardcase May 17, 2025 16:13
@phuhung273
Copy link
Contributor Author

Hi @richardcase, if you have time, please help me review the latest update

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 4, 2025
@phuhung273 phuhung273 force-pushed the eks-upgrade-policy branch from 4b49e38 to 00efef1 Compare June 15, 2025 11:19
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 15, 2025
@phuhung273
Copy link
Contributor Author

Hi @richardcase, can you help me review this

@phuhung273
Copy link
Contributor Author

Hi @richardcase, can you help review the change addressing your comments.

Comment on lines 213 to 219
// The support policy to use for the cluster.
// Extended support indicates that the cluster will not be automatically upgraded
// when it leaves the standard support period, and will enter extended support.
// Clusters in extended support have higher costs.
// The default value is extended. Use standard to disable extended support.
// +kubebuilder:validation:Enum=extended;standard
// +optional
UpgradePolicy *UpgradePolicy `json:"upgradePolicy,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it have to be a pointer? Considering it is marked as optional If we omit it it'll be empty right?
Then in that case we can default it to extended.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agree. As this is a type alias for a string, we can do without the pointer.

Copy link
Contributor Author

@phuhung273 phuhung273 Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case current cluster is STANDARD, eg: by workaround since CAPA doesn't support yet. Defaulting to EXTENDED will suddenly switch all clusters to EXTENDED, increasing the cost.

There is a testcase for this situation: if the field is nil, don't do anything https://github.com/phuhung273/cluster-api-provider-aws/blob/00efef15c7cb5263a6409d2b998f8583d5e0d1bd/pkg/cloud/services/eks/cluster_test.go#L703-L710

Do you think it is a valid case ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the default in the SDK for EKS if nothing is specified as a policy at the moment? If the default is "standard" then we should keep standard as default here IMO. And that wouldn't cause the issue you are talking about

Copy link
Contributor Author

@phuhung273 phuhung273 Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default:

  • CreateCluster is EXTENDED
  • UpdateCluster doesn't touch that param if not specified.

The case im mentioning is when user have other stuff managing clusters on top of CAPA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree. In that case we can add a comment that explains how CAPA will behave when the user omits the value or sets it to "".
i.e. It will let the AWS Platform choose whatever default value is there for the upgrade policy (at the time of writing Extended, but this may change in the future).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we have an agreement on: Extended, Standard and omitted (""). Thank you both.

I think for new clusters we should explicitly state what the default is in our API.

Regarding to this point, we should explicitly set EXTENDED in createCluster if user ommit rite ?

Copy link
Member

@damdo damdo Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding to this point, we should explicitly set EXTENDED in createCluster if user ommit rite ?

We should omit setting it in createCluster if the user omits it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the confirmation, let me update the PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated to: Extended, Standard or omitted (""). PTAL @damdo @richardcase, thanks.

Copy link
Member

@richardcase richardcase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @phuhung273 .

It would also be worth adding a line to the docs so state that by default we create a cluster with "extended" support, perhaps somehwre here: https://cluster-api-aws.sigs.k8s.io/topics/eks/creating-a-cluster

Comment on lines 213 to 219
// The support policy to use for the cluster.
// Extended support indicates that the cluster will not be automatically upgraded
// when it leaves the standard support period, and will enter extended support.
// Clusters in extended support have higher costs.
// The default value is extended. Use standard to disable extended support.
// +kubebuilder:validation:Enum=extended;standard
// +optional
UpgradePolicy *UpgradePolicy `json:"upgradePolicy,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agree. As this is a type alias for a string, we can do without the pointer.

return true, nil
}

func convertUpgradePolicy(input ekscontrolplanev1.UpgradePolicy) ekstypes.SupportType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We sometimes but these in the "convert" package when converting to/from SDK types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, can see a similar converter package. I moved this function there.

@phuhung273 phuhung273 force-pushed the eks-upgrade-policy branch 3 times, most recently from b5b2dc7 to 12b6056 Compare September 4, 2025 17:20
@phuhung273
Copy link
Contributor Author

/test pull-cluster-api-provider-aws-e2e-blocking

@phuhung273
Copy link
Contributor Author

/test pull-cluster-api-provider-aws-e2e

2 similar comments
@phuhung273
Copy link
Contributor Author

/test pull-cluster-api-provider-aws-e2e

@phuhung273
Copy link
Contributor Author

/test pull-cluster-api-provider-aws-e2e

@phuhung273
Copy link
Contributor Author

Look like test infra issue

@phuhung273
Copy link
Contributor Author

/test pull-cluster-api-provider-aws-e2e

2 similar comments
@damdo
Copy link
Member

damdo commented Sep 25, 2025

/test pull-cluster-api-provider-aws-e2e

@phuhung273
Copy link
Contributor Author

/test pull-cluster-api-provider-aws-e2e

@damdo
Copy link
Member

damdo commented Sep 25, 2025

@phuhung273
Copy link
Contributor Author

phuhung273 commented Sep 25, 2025

Context for that error here: https://kubernetes.slack.com/archives/CD6U2V71N/p1758795545213209

This one is spot instance test so if it is flaky i'm not surprised 😅 So i think it make sense to retry.

But this run https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-aws/5471/pull-cluster-api-provider-aws-e2e/1971153449110212608 it failed 3 other cases, while the spot one you're referring passed.

@damdo
Copy link
Member

damdo commented Sep 25, 2025

Context for that error here: https://kubernetes.slack.com/archives/CD6U2V71N/p1758795545213209

This one is spot instance test so if it is flaky i'm not surprised 😅 So i think it make sense to retry.

But this run https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-aws/5471/pull-cluster-api-provider-aws-e2e/1971153449110212608 it failed 3 other cases, while the spot one you're referring passed.

The underlying issue is the same one. If we fix that, we fix all of them. Let's keep chatting on that thread

@damdo
Copy link
Member

damdo commented Sep 26, 2025

/test pull-cluster-api-provider-aws-e2e

@phuhung273
Copy link
Contributor Author

Just rebase to see if thing gets better.

@phuhung273
Copy link
Contributor Author

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

@k8s-ci-robot
Copy link
Contributor

@phuhung273: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-aws-e2e 52f04d0 link false /test pull-cluster-api-provider-aws-e2e
pull-cluster-api-provider-aws-e2e-eks 52f04d0 link false /test pull-cluster-api-provider-aws-e2e-eks

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/feature Categorizes issue or PR as related to a new feature. needs-priority ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support EKS upgrade policy
7 participants