-
Notifications
You must be signed in to change notification settings - Fork 132
feat: support for ultrassd #1704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
ead5d7d
e54289f
02645f4
2361e64
2b7cf78
0fe9f2f
d251af0
01cb5f7
f362359
6c920a2
e8b4112
4f13615
79d2de7
24eca5a
a5533e5
f85552c
228fbd9
2b81077
5356a06
b0cb705
c44b258
0b57f23
80f8d32
83baca6
6e18e9e
f76ef6f
54fbd30
c2f8e8d
31452b2
a748ddf
d92fbc6
7066b58
3f0f0ec
13f67ee
7b7d726
7dab9c5
636241c
19bc586
e9044a6
f3a537b
33205fd
01033e8
5d1c4d2
a5b0109
2330c3e
1c6af9f
d0034c4
30a02f5
8146ba9
abab37a
101e974
a9ae1dd
da0a33a
eb1bb9c
f204f43
1cd09aa
4d754a1
4e337fa
b7ad38c
90467f8
4199ed1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,170 @@ | ||
| # Ultra SSD Support for NAP | ||
|
|
||
| **Author:** @pablotrivino | ||
|
|
||
| **Last updated:** June 1, 2026 | ||
|
|
||
| **Status:** Proposed | ||
|
|
||
| ## Overview | ||
|
|
||
| AKS supports Azure Ultra Disks by enabling Ultra SSD on the cluster or on a node pool at creation time with `--enable-ultra-ssd`. Nodes created from that cluster or node pool can then attach Persistent Volumes backed by the `UltraSSD_LRS` storage class. | ||
|
|
||
| Today in AKS, `--enable-ultra-ssd` ultimately enables `AdditionalCapabilities.UltraSSDEnabled = true` on the underlying VM or VMSS model. That does not automatically add labels, taints, or tolerations for scheduling. It only makes the node capable of attaching Ultra SSDs for workloads that use an UltraSSD-backed PV. Placement policy remains the user's responsibility. | ||
|
|
||
| For Node Auto Provisioning (NAP), we need the equivalent behavior on dynamically created capacity. This means Karpenter must be able to: | ||
|
|
||
| - express Ultra SSD as part of node configuration, | ||
| - filter out VM sizes and zonal offerings that do not support Ultra SSD, | ||
| - set the correct downstream API fields when creating capacity | ||
|
|
||
| This document proposes how to complete that work for NAP. | ||
|
|
||
| ### Goals | ||
|
|
||
| - Add support for enabling Ultra SSD on dynamically provisioned nodes. | ||
| - Support both VM provisioning mode and AKS Machine API mode. | ||
|
rakechill marked this conversation as resolved.
|
||
| - Filter instance types and offerings to only Ultra SSD-capable SKU plus zone combinations when the feature is enabled. | ||
|
|
||
| ### Non-Goals | ||
|
|
||
| - Adding provider-managed scheduling controls beyond offerings filtering, such as automatic Requirements, labels, taints, or tolerations. | ||
| - Automatically steering Ultra SSD workloads onto Ultra SSD-capable nodes. | ||
|
|
||
| ## Decisions | ||
|
|
||
| ### Decision 1: Where should Ultra SSD be configured? | ||
|
|
||
| #### Add a strongly typed field to `AKSNodeClass` | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you consider a label? Are labels, and feel very similar to me to ultraSSD. I also think doing a label would make the implementation a bit cleaner (because It also means that a workload could ask for ultraSSD on the workload, rather than having to have a totally separate NodePool. So I am wondering why not a well known label rather than a field on AKSNodeClass. |
||
|
|
||
| Proposed shape: | ||
|
|
||
| ```yaml | ||
| apiVersion: karpenter.azure.com/v1beta1 | ||
| kind: AKSNodeClass | ||
| spec: | ||
| ultraSSD: | ||
| enabled: true | ||
| ``` | ||
|
|
||
| Suggested Go shape: | ||
|
|
||
| ```go | ||
| type UltraSSD struct { | ||
| Enabled *bool `json:"enabled,omitempty"` | ||
| } | ||
|
|
||
| type AKSNodeClassSpec struct { | ||
| // ... existing fields ... | ||
| UltraSSD *UltraSSD `json:"ultraSSD,omitempty"` | ||
| } | ||
|
|
||
| func (in *AKSNodeClass) IsUltraSSDEnabled() bool { | ||
| return in.Spec.UltraSSD != nil && | ||
| in.Spec.UltraSSD.Enabled != nil && | ||
| *in.Spec.UltraSSD.Enabled | ||
| } | ||
| ``` | ||
|
|
||
| This matches the existing API style for feature toggles such as `artifactStreaming`, `security.encryptionAtHost`, and `localDNS`. | ||
| Ultra SSD should be configured as a strongly typed `AKSNodeClass` feature, not as a raw requirement. | ||
|
rakechill marked this conversation as resolved.
|
||
|
|
||
| Reasons: | ||
|
|
||
| - it is a provisioning feature, not a schedulable label, | ||
| - it aligns with the current `AKSNodeClass` design pattern | ||
|
|
||
| The user expectation of “default false” is still satisfied. If `spec.ultraSSD` or `spec.ultraSSD.enabled` is omitted, the effective value is disabled. | ||
|
|
||
| ### Decision 2: How should we filter for compatible Instances? | ||
|
|
||
| #### Offerings Filtering | ||
|
|
||
| Ultra SSD is only available in regions and zones that support it, and only by specific SKUs. Therefore, we need to check availability for each zone when creating Offerings for InstanceTypes. | ||
|
rakechill marked this conversation as resolved.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does AKS API validates UltraSSD enablement?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added a reference to this doc https://learn.microsoft.com/en-us/azure/aks/use-ultra-disks. AKS will reject requests for cluster or nodepool creations that have |
||
|
|
||
| #### Decision 3: Should the provider add labels, requirements, taints, or tolerations? | ||
|
|
||
| #### No provider-managed scheduling projection | ||
|
|
||
| We will not add Ultra SSD-specific Requirements, Labels, Taints, or Tolerations from the provider. | ||
|
|
||
| Rationale: | ||
|
|
||
| - this matches current AKS behavior, where `--enable-ultra-ssd` enables attachment capability but does not impose placement policy, | ||
| - the primary job of this feature is to make the node capable of attaching Ultra SSDs, not to decide which workloads should land on it, | ||
| - users who want explicit scheduling separation can model that themselves in the `NodePool` using labels, taints, tolerations, or affinity. | ||
|
|
||
| #### Conclusion | ||
|
|
||
| The implementation should follow the established provider pattern: | ||
|
|
||
| 1. strongly typed `AKSNodeClass` feature, | ||
| 2. helper accessor like `IsUltraSSDEnabled()`, | ||
| 3. instance type and offering filtering, | ||
| 4. downstream API wiring in both provisioning modes. | ||
|
|
||
| ## Proposed Implementation | ||
|
|
||
| ### API changes | ||
|
|
||
| Add a new field to `AKSNodeClass`: | ||
|
|
||
| ```yaml | ||
| spec: | ||
| ultraSSD: | ||
| enabled: true | ||
| ``` | ||
|
|
||
| Semantics: | ||
|
|
||
| - default is disabled when omitted, | ||
| - enabling it opts the node class into Ultra SSD-capable capacity only, | ||
| - changing it triggers node replacement through drift. | ||
|
rakechill marked this conversation as resolved.
|
||
|
|
||
| ### Filtering | ||
|
|
||
| Filter out InstanceTypes that don't support UltraSSD when it is enabled. | ||
|
|
||
| - UltraSSD is also region and zone dependent, so we need to filter out at Offering level | ||
| - Add a check during createOfferings to verify that the zone + SKU support UltraSSD | ||
|
|
||
| ### Scheduling behavior | ||
|
|
||
| The provider will not add Ultra SSD-specific Requirements, Labels, Taints, or Tolerations. | ||
|
|
||
| If users want workloads that use UltraSSD-backed PVs to land only on Ultra SSD-capable nodes, they must model that in their own `NodePool` and workload configuration. | ||
|
|
||
| Examples of user-managed policy include: | ||
|
|
||
| - adding labels to the `NodePool` template, | ||
| - adding taints to the `NodePool`, | ||
| - adding tolerations and affinity to workloads. | ||
|
|
||
| ### VM mode wiring | ||
|
|
||
| #### VM | ||
| Update VM creation so Ultra SSD-enabled node classes set `vm.Properties.AdditionalCapabilities.UltraSSDEnabled = true`. This is left nil if UltraSSD is not enabled, which is consistent with AKS. | ||
|
|
||
| #### Machine | ||
| Set `armcontainerservice.Machine.Properties.MachineProperties.MachineHardwareProfile.UltraSsdEnabled = true` if enabled and `false` if disabled. This is consistent with AKS. | ||
|
|
||
| This mirrors the current AKS behavior behind `--enable-ultra-ssd`: the node is made capable of attaching Ultra SSDs, but scheduling policy is left to the user. | ||
|
|
||
| ### AKS Machine API wiring | ||
|
|
||
| Update AKS machine template creation so Ultra SSD-enabled node classes set `aksMachine.Properties.EnableUltraSSD = true`. | ||
|
|
||
| ### Customer Experience and AKS Parity | ||
|
|
||
| Customers wishing to use UltraSSD will set the ultraSSD field on their AKSNodeClass CR to true. This field will be used to filter out offerings to those SKUs and zones that support it (i.g. making sure that the SKU supports UltraSSD in the given zones). | ||
|
|
||
| In AKS, creating a cluster with `--enable-ultra-ssd` means the initial system pool gets UltraSSD capabilities. Additional pools must also explicitly include the `--enable-ultra-ssd` flag at creation time to enable it. Validation runs at cluster/pool validation and rejects the request if the user did not specify zones, or the SKU does not support UltraSSD in any of the zones, and all the nodes belonging to a pool created with the flag are UltraSSD capable. Clusters can have any mix of UltraSSD-enabled and disabled pools, regardless if the cluster was initially created with `--enable-ultra-ssd` or not. | ||
|
|
||
| For NAP parity, enabling the feature in an AKSNodeClass means Karpenter will only consider offerings whose zone has UltraSSD available for the given SKU, and it will automatically set those nodes to support UltraSSD. If a customer disables the feature in the AKSNodeClass CR, then the nodes will be considered drifted and re-created with the UltraSSD support disabled. AKS does not add any kind of label, annotation, or taint to the nodes saying UltraSSD is enabled, so NAP doesn't either. | ||
|
|
||
| See References section for more information on what AKS does. | ||
|
|
||
| ## References | ||
|
|
||
| - AKS Ultra Disks documentation: https://learn.microsoft.com/en-us/azure/aks/use-ultra-disks | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth clarifying in the design doc the differences between cluster-level and pool-level API in AKS, and how Karpenter support here build on top of that.
E.g., if it is enabled at cluster-level, what will happen to the pools that don't enable it, vice-versa? If it is really just systempool and likely doesn't affect anything in Karpenter layer, then still worth a note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note to the design doc in the cx experience section