feat: support for ultrassd by PabloTriv · Pull Request #1704 · Azure/karpenter-provider-azure

PabloTriv · 2026-06-02T14:09:12Z

Fixes #

Description
This change introduces support for ultra ssd. Included is a design doc that talks more about the implementation. UltraSSD will be enabled/disabled via a setting on the AKSNodeClass. The implementation here should be as close to AKS as possible.

How was this change tested?
New E2E: https://github.com/Azure/karpenter-provider-azure/actions/runs/27987096466/job/82831071878
*

Does this change impact docs?

Yes, PR includes docs updates
Yes, issue opened: #
No

Release Note

rakechill

I think my main concern (reflected in a few different comments) is if we should treat the source of truth for a NAP cluster as the MachineAPI/ManagedCluster state OR the aksnodeclass fields?

Need a few things confirmed:

Is this only set during cluster creation? or can it be enabled/disabled through the cluster lifecycle?
Is it set at the cluster level? agentpool level? or both?
Will/can/should Machine API report this as a DriftAction when the MC field for this changes?

If this is only enabled at cluster creation for AKS, I think this should instead be set via helm values -> env vars/flags -> opts.

If this can be disabled/enabled after cluster creation, I think this nodeclass field should be limited to self-hosted and machineapi should handle setting this for machines based on MC object.

PabloTriv · 2026-06-11T19:13:47Z

Is this only set during cluster creation? or can it be enabled/disabled through the cluster lifecycle?

This is set during cluster creation and nodepool creation. It can't be changed once a pool has been created. Enabling at the cluster means the system pool gets the UltraSSD capabilities, not that all nodepools/nodes in the cluster will have the functionality by default.

Is it set at the cluster level? agentpool level? or both?

It's agentpool level. When you pass it during cluster create you're just enabling it on the initial create request. And other nodepools you add need to opt in. It's also immutable.

Will/can/should Machine API report this as a DriftAction when the MC field for this changes?

The field can't change once the cluster/pool is created, so I don't think we'd have to worry about this.

If this is only enabled at cluster creation for AKS, I think this should instead be set via helm values -> env vars/flags -> opts.

I think this would make sense if it was a global field that controlled all nodepools. But today each pool needs to opt into it, which is why I think it would make sense to stick it into either the NodePool or AKSNodeClass CRDs. Clusters can have a mix of ultrassd enabled/disabled pools on the same cluster.

rakechill · 2026-06-11T19:50:32Z

...I think this would make sense if it was a global field that controlled all nodepools. But today each pool needs to opt into it, which is why I think it would make sense to stick it into either the NodePool or AKSNodeClass CRDs. Clusters can have a mix of ultrassd enabled/disabled pools on the same cluster.

I see, didn't realize that enabling for the cluster == enabling for the system nodepool. In that case, it makes sense to me for this to be something that gets enabled from the Karpenter side. Will customers be able to change this after the fact then? That would be different behavior than what AKS currently provides.

I'd expect there to be some validation if a customer tries to switch from Enabled to Disabled.

Or, if we don't plan to mimic RP behavior, can we get clarity on why RP doesn't allow disablement after enablement?

rakechill

Your comments from the previous review make sense + I better understand the agentpool<>cluster relationship wrt this feature.

A few remaining thoughts:

Is there any way to make this per-nodepool? I expect many clusters will have different Nodepools for different use cases + may prefer that level of granularity. Maybe checking with @wdarko1 based on current customer requirements could be useful.
Are the other features we have on AKSNodeClass also configurable per-agentpool in AKS? Or are they cluster-wide?
Should we block disablement after enablement to align with current AKS behavior?
Should we only allow ultra SSD to be enabled in Karpenter if it's enabled on the cluster? Basically introducing a validation here based on cluster state.

PabloTriv · 2026-06-12T22:17:45Z

Is there any way to make this per-nodepool? I expect many clusters will have different Nodepools for different use cases + may prefer that level of granularity. Maybe checking with @wdarko1 based on current customer requirements could be useful.

I feel like using the AKSNodeClass is the pattern that's already established. NodePools seem to be more about generic Karpenter scheduling specs and NodeClass about provider specific stuff. The NodePool also references the NodeClass so cx still has some level of granularity by swapping around the NodeClass that their NodePool points to. Will ask @wdarko1 for his thoughts.

Are the other features we have on AKSNodeClass also configurable per-agentpool in AKS? Or are they cluster-wide?

FIPS, LocalDNS, and Artifact Streaming all operate at nodepool level in AKS. In particular, FIPS seems the most similar, since you can enable it at cluster creation time for the system pool but requires opt-in for subsequent nodepools.

Should we block disablement after enablement to align with current AKS behavior?

AKS RP does reject attempts to change the UltraSSD settings on an existing pool today. In theory, if someone tries to edit the field on the NodeClass CR, we should consider the node as drifted and replace it.

Should we only allow ultra SSD to be enabled in Karpenter if it's enabled on the cluster? Basically introducing a validation here based on cluster state.

I think No. Today in AKS you can make a cluster without enabling ultra ssd at creation and add an ultra-ssd enabled nodepool to it.

PabloTriv · 2026-06-12T22:34:56Z

I think that the proposed implementation would be pretty consistent with AKS. If you disable the field on the NodeClass, all the NodeClaims belonging to that NodePool/Class would be considered drifted and replaced, the same way you'd have to replace a nodepool in AKS.

Interestingly, while AKS nodepools do not let you update the ultra ssd field, standalone VMs can actually enable/disable it, but require being stopped and deallocated first.

rakechill

design lgtm, let's just add a customer experience section to the design doc that makes the enablement/disablement conditions clear + adds support/reasoning from the docs + current AKS behavior.

comtalyst · 2026-06-16T00:02:48Z

+
+Reasons:
+
+- it is a provisioning feature, not just a schedulable label,


Suggested change

- it is a provisioning feature, not just a schedulable label,

- it is a provisioning feature, not a schedulable label,

comtalyst · 2026-06-16T00:05:46Z

+
+#### Offerings Filtering
+
+Ultra SSD is only available in regions and zones that support it, and only by specific SKUs. Therefore, we need to check availability for each zone when creating Offerings for InstanceTypes.


How does AKS API validates UltraSSD enablement?
Worth referencing that here and "prove" that Karpenter obliges it, especially if that is determined by how Karpenter select SKUs.
This is so that we can avoid provisioning time failures or accidentally having to support what AKS doesn't.

I added a reference to this doc https://learn.microsoft.com/en-us/azure/aks/use-ultra-disks. AKS will reject requests for cluster or nodepool creations that have --enable-ultra-ssd but whose SKUs don't have UltraSSD available in the specified regions

comtalyst · 2026-06-16T00:48:11Z

+func configureUltraSSDEnabled(nodeClass *v1beta1.AKSNodeClass) *bool {
+	if nodeClass == nil || nodeClass.IsUltraSSDEnabled() == false {
+		return nil
+	}
+	return lo.ToPtr(nodeClass.IsUltraSSDEnabled())
+}
+


Given this field affects instance filtering owned by Karpenter, Karpenter wants to own it and give Machine API a clear true or false.

Karpenter wouldn't want to give Machine API an unclear nil if Karpenter already assumes that it is false through instance filtering.

Let's also make this point of defaulting in each layer clear in the design doc.

Updated it for machine, which mirrors AKS behavior. Today in AKS though, vmss where UltraSSD == false leave the property as nil, so I'm thinking we do the same for the bootstrappingclient path.

comtalyst · 2026-06-16T00:57:10Z

+
+## Overview
+
+AKS supports Azure Ultra Disks by enabling Ultra SSD on the cluster or on a node pool at creation time with `--enable-ultra-ssd`. Nodes created from that cluster or node pool can then attach Persistent Volumes backed by the `UltraSSD_LRS` storage class.


Worth clarifying in the design doc the differences between cluster-level and pool-level API in AKS, and how Karpenter support here build on top of that.
E.g., if it is enabled at cluster-level, what will happen to the pools that don't enable it, vice-versa? If it is really just systempool and likely doesn't affect anything in Karpenter layer, then still worth a note.

Added a note to the design doc in the cx experience section

wdarko1 · 2026-06-16T17:42:31Z

Is there any way to make this per-nodepool? I expect many clusters will have different Nodepools for different use cases + may prefer that level of granularity. Maybe checking with @wdarko1 based on current customer requirements could be useful.

I feel like using the AKSNodeClass is the pattern that's already established. NodePools seem to be more about generic Karpenter scheduling specs and NodeClass about provider specific stuff. The NodePool also references the NodeClass so cx still has some level of granularity by swapping around the NodeClass that their NodePool points to. Will ask @wdarko1 for his thoughts.

Are the other features we have on AKSNodeClass also configurable per-agentpool in AKS? Or are they cluster-wide?

FIPS, LocalDNS, and Artifact Streaming all operate at nodepool level in AKS. In particular, FIPS seems the most similar, since you can enable it at cluster creation time for the system pool but requires opt-in for subsequent nodepools.

Should we block disablement after enablement to align with current AKS behavior?

AKS RP does reject attempts to change the UltraSSD settings on an existing pool today. In theory, if someone tries to edit the field on the NodeClass CR, we should consider the node as drifted and replace it.

Should we only allow ultra SSD to be enabled in Karpenter if it's enabled on the cluster? Basically introducing a validation here based on cluster state.

I think No. Today in AKS you can make a cluster without enabling ultra ssd at creation and add an ultra-ssd enabled nodepool to it.

Agree that customers should be able to select UltraSSD at the nodepool level. Customers can use multiple AKSNodeClasss IIRC so customers can just point their nodepools to a specific AKSNodeClass when needed and still achieve a node/nodepool level UltraSSD enablement. @PabloTriv @rakechill

…e/karpenter-provider-azure into pablotrivino/enalbeUltraSSD

matthchr · 2026-06-24T17:13:16Z

 	LinuxOSConfig *LinuxOSConfiguration `json:"linuxOSConfig,omitempty"`
+	// ultraSSD enables Ultra SSD for the provisioned nodes.
+	// +optional
+	UltraSSD *UltraSSD `json:"ultraSSD,omitempty"`


We didn't put OSDiskSizeGB there but I wonder if we want a Disk or Storage section here, so we can group storage-related capabilities in the same way we do for GPU/Linux/etc, rather than UltraSSD which is too narrowly scoped (probably) to ever get new fields:

storage: enableUltraSSD:

or somthing?

matthchr · 2026-06-24T17:17:04Z

 	offerings := []*cloudprovider.Offering{}
 	for zone := range offeringZones {
+		if params.UltraSSDEnabled {
+			if zone == "0" && !sku.IsUltraSSDAvailableWithoutAvailabilityZone() {


Do we have examples where ultrassd is enabled in some zones but not others? I assume this must happen given skewer supports it but might be good to understand what exactly that looks like.

matthchr · 2026-06-24T17:20:08Z

+
+### Decision 1: Where should Ultra SSD be configured?
+
+#### Add a strongly typed field to `AKSNodeClass`


Did you consider a label?

scheduling.NewRequirement(v1beta1.LabelSKUStoragePremiumCapable, corev1.NodeSelectorOpIn, fmt.Sprint(sku.IsPremiumIO())), scheduling.NewRequirement(v1beta1.LabelSKUAcceleratedNetworking, corev1.NodeSelectorOpIn, fmt.Sprint(sku.IsAcceleratedNetworkingSupported())),

Are labels, and feel very similar to me to ultraSSD. I also think doing a label would make the implementation a bit cleaner (because computeRequirements already takes params *instanceTypeParameters and the SKU, so we could do all of the filtering for ultraSSD+ Zones there).

It also means that a workload could ask for ultraSSD on the workload, rather than having to have a totally separate NodePool.

So I am wondering why not a well known label rather than a field on AKSNodeClass.

Pablo Trivino added 6 commits June 1, 2026 12:39

design doc

ead5d7d

define in struct

e54289f

generate aksnodeclass edits

02645f4

filters

2361e64

add ultra ssd

2b7cf78

remove link

0fe9f2f

rakechill reviewed Jun 10, 2026

View reviewed changes

Comment thread designs/0012-ultra-ssd-support.md

Comment thread designs/0012-ultra-ssd-support.md

Comment thread designs/0012-ultra-ssd-support.md

rakechill reviewed Jun 11, 2026

View reviewed changes

Comment thread designs/0012-ultra-ssd-support.md

rakechill reviewed Jun 15, 2026

View reviewed changes

Pablo Trivino added 5 commits June 15, 2026 13:05

make ultra ssd setting consistent

d251af0

resolve conflict and update parameter use to be consistent

01cb5f7

reformat for consistency

f362359

add small test

6c920a2

Create e2e test file

e8b4112

comtalyst reviewed Jun 16, 2026

View reviewed changes

e2e skeleton

4f13615

Pablo Trivino and others added 8 commits June 16, 2026 13:55

cx experience section

79d2de7

disable

24eca5a

Merge branch 'main' into pablotrivino/enalbeUltraSSD

a5533e5

rm just

f85552c

Merge branch 'pablotrivino/enalbeUltraSSD' of https://github.com/Azur…

228fbd9

…e/karpenter-provider-azure into pablotrivino/enalbeUltraSSD

do not return nil

2b81077

nil check

5356a06

godoc

b0cb705

Pablo Trivino and others added 9 commits June 19, 2026 16:16

add ctx

c2f8e8d

proper filtering

31452b2

update comment and remove ctx parameter

a748ddf

undo focus changes

d92fbc6

undo changes to e2e yaml

7066b58

make ci nontest

3f0f0ec

second round of make ci-non-test

13f67ee

settings fix

7b7d726

add nl

7dab9c5

matthchr reviewed Jun 24, 2026

View reviewed changes

Pablo Trivino and others added 20 commits June 26, 2026 16:27

change to label

636241c

label impl

19bc586

pass the label downstream as an argument to enable ultra ssd

e9044a6

rm ultrassd

f3a537b

generate diff

33205fd

better equalfold

01033e8

remove a bunch of nodeclass stuff

5d1c4d2

rm params

a5b0109

register

2330c3e

add focus

1c6af9f

update test

d0034c4

add in labels

30a02f5

switch to label

8146ba9

generate yamls

abab37a

make verify

101e974

update labels

a9ae1dd

make verify

da0a33a

edit test suite to expect well known label

eb1bb9c

undo change

f204f43

add another test

1cd09aa


		Reasons:

		- it is a provisioning feature, not just a schedulable label,

	- it is a provisioning feature, not just a schedulable label,
	- it is a provisioning feature, not a schedulable label,


		#### Offerings Filtering

		Ultra SSD is only available in regions and zones that support it, and only by specific SKUs. Therefore, we need to check availability for each zone when creating Offerings for InstanceTypes.


		## Overview

		AKS supports Azure Ultra Disks by enabling Ultra SSD on the cluster or on a node pool at creation time with `--enable-ultra-ssd`. Nodes created from that cluster or node pool can then attach Persistent Volumes backed by the `UltraSSD_LRS` storage class.


		### Decision 1: Where should Ultra SSD be configured?

		#### Add a strongly typed field to `AKSNodeClass`

Uh oh!

Conversation

PabloTriv commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rakechill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PabloTriv commented Jun 11, 2026

Uh oh!

rakechill commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rakechill left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

PabloTriv commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PabloTriv commented Jun 12, 2026

Uh oh!

rakechill left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comtalyst Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wdarko1 commented Jun 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

PabloTriv commented Jun 2, 2026 •

edited

Loading

rakechill commented Jun 11, 2026 •

edited

Loading

rakechill left a comment •

edited

Loading

PabloTriv commented Jun 12, 2026 •

edited

Loading

comtalyst Jun 16, 2026 •

edited

Loading