feat: add managed NVIDIA GPU experience to AKSNodeClass#1752
Draft
MaximilianoUribe wants to merge 3 commits into
Draft
feat: add managed NVIDIA GPU experience to AKSNodeClass#1752MaximilianoUribe wants to merge 3 commits into
MaximilianoUribe wants to merge 3 commits into
Conversation
Add gpu.nvidia.managementMode (Managed|Unmanaged) to the v1beta1 AKSNodeClass and forward it to the AKS machine API GPU profile, allowing GPU NodePools to opt into the managed NVIDIA GPU experience (AKS installs and manages the NVIDIA device plugin and DCGM metrics on top of the GPU driver) instead of managing those components themselves. The field is optional and defaults to Unmanaged, so existing GPU NodeClasses are unaffected (no hash/version change). Managed requires driver installation (gpu.mode != None) and applies only to NVIDIA GPU SKUs; both are enforced via a CEL rule and a defensive guard in the machine builder.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #
Description
Adds an opt-in managed NVIDIA GPU experience to
AKSNodeClass. A new optional fieldgpu.nvidia.managementMode(Managed|Unmanaged) lets a GPU NodePool request that AKS install and manage the NVIDIA stack on top of the GPU driver — i.e. the NVIDIA device plugin and DCGM metrics exporter — instead of the user managing those components themselves. When provisioning via the AKS machine API, the field is forwarded on the GPU profile and the managed components are provided by the AKS control plane.Changes included:
gpu.nvidia.managementModeto the v1beta1AKSNodeClass(newNvidiaGPUtype andManagementModeenum), plus nil-safe helpersGetManagementMode()/IsManagedGPUEnabled().GPUProfile.Nvidia.ManagementModewhen building the AKS machine. The setting is only emitted for NVIDIA GPU SKUs with driver installation enabled; otherwiseNvidiais left unset.managementMode: Managedtogether withgpu.mode: None, since the managed experience requires driver installation.Backward compatibility:
Unmanaged. Omitting it produces the exact same request as today, and a nil value does not change theAKSNodeClasshash, so existing GPU NodeClasses do not drift. No hash-version bump is required.Scope:
migStrategy/gpuInstanceProfile) is intentionally left as a follow-up; it can be added additively under the samegpu.nvidiablock without breaking this change.How was this change tested?
Unit/package validation:
go test ./pkg/apis/v1beta1 ./pkg/apis/v1alpha2 ./pkg/providers/instance golangci-lint-custom run ./pkg/apis/... ./pkg/providers/instance/...Added coverage:
configureGPUProfile: unset → noNvidia;Unmanaged→ noNvidia;Managedon an NVIDIA SKU →Nvidia.ManagementMode=Managed;Managedon an AMD SKU → noNvidia(NVIDIA-only guard);Managedwithmode: None→ noNvidia(driver-install guard).Managed(default/explicitDrivermode) andUnmanaged(any mode); rejectsManaged+mode: None; rejects invalid enum values.managementModechanges the hash; a nilnvidialeaves the hash unchanged (non-breaking).GetManagementMode()/IsManagedGPUEnabled().Does this change impact docs?
Release Note