Skip to content

feat: add managed NVIDIA GPU experience to AKSNodeClass#1752

Draft
MaximilianoUribe wants to merge 3 commits into
mainfrom
muribefalcon/managed-gpu-nap
Draft

feat: add managed NVIDIA GPU experience to AKSNodeClass#1752
MaximilianoUribe wants to merge 3 commits into
mainfrom
muribefalcon/managed-gpu-nap

Conversation

@MaximilianoUribe

Copy link
Copy Markdown
Collaborator

Fixes #

Description

Adds an opt-in managed NVIDIA GPU experience to AKSNodeClass. A new optional field gpu.nvidia.managementMode (Managed | Unmanaged) lets a GPU NodePool request that AKS install and manage the NVIDIA stack on top of the GPU driver — i.e. the NVIDIA device plugin and DCGM metrics exporter — instead of the user managing those components themselves. When provisioning via the AKS machine API, the field is forwarded on the GPU profile and the managed components are provided by the AKS control plane.

Changes included:

  • Adds gpu.nvidia.managementMode to the v1beta1 AKSNodeClass (new NvidiaGPU type and ManagementMode enum), plus nil-safe helpers GetManagementMode() / IsManagedGPUEnabled().
  • Forwards the mode to GPUProfile.Nvidia.ManagementMode when building the AKS machine. The setting is only emitted for NVIDIA GPU SKUs with driver installation enabled; otherwise Nvidia is left unset.
  • Adds a CEL validation rule rejecting managementMode: Managed together with gpu.mode: None, since the managed experience requires driver installation.
  • Regenerates the CRD and deepcopy.

Backward compatibility:

  • The field is optional and defaults to Unmanaged. Omitting it produces the exact same request as today, and a nil value does not change the AKSNodeClass hash, so existing GPU NodeClasses do not drift. No hash-version bump is required.
  • The field is added to v1beta1 only (the v1alpha2 schema is unchanged).

Scope:

  • This PR covers the management-mode toggle only. Managed Multi-Instance GPU (MIG) configuration (migStrategy / gpuInstanceProfile) is intentionally left as a follow-up; it can be added additively under the same gpu.nvidia block without breaking this change.

How was this change tested?

Unit/package validation:

go test ./pkg/apis/v1beta1 ./pkg/apis/v1alpha2 ./pkg/providers/instance
golangci-lint-custom run ./pkg/apis/... ./pkg/providers/instance/...

Added coverage:

  • configureGPUProfile: unset → no Nvidia; Unmanaged → no Nvidia; Managed on an NVIDIA SKU → Nvidia.ManagementMode=Managed; Managed on an AMD SKU → no Nvidia (NVIDIA-only guard); Managed with mode: None → no Nvidia (driver-install guard).
  • CEL: accepts Managed (default/explicit Driver mode) and Unmanaged (any mode); rejects Managed + mode: None; rejects invalid enum values.
  • Hash: setting managementMode changes the hash; a nil nvidia leaves the hash unchanged (non-breaking).
  • Helpers: nil-safety for GetManagementMode() / IsManagedGPUEnabled().

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

Release Note

Add an opt-in managed NVIDIA GPU experience to AKSNodeClass via gpu.nvidia.managementMode (Managed|Unmanaged). When Managed, AKS manages the NVIDIA device plugin and DCGM metrics on top of the GPU driver. Defaults to Unmanaged; existing GPU NodeClasses are unaffected.

Add gpu.nvidia.managementMode (Managed|Unmanaged) to the v1beta1
AKSNodeClass and forward it to the AKS machine API GPU profile, allowing
GPU NodePools to opt into the managed NVIDIA GPU experience (AKS installs
and manages the NVIDIA device plugin and DCGM metrics on top of the GPU
driver) instead of managing those components themselves.

The field is optional and defaults to Unmanaged, so existing GPU
NodeClasses are unaffected (no hash/version change). Managed requires
driver installation (gpu.mode != None) and applies only to NVIDIA GPU
SKUs; both are enforced via a CEL rule and a defensive guard in the
machine builder.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant