feat: support Windows node provisioning via the AKS Machine API#1726
Open
MarcPow wants to merge 1 commit into
Open
feat: support Windows node provisioning via the AKS Machine API#1726MarcPow wants to merge 1 commit into
MarcPow wants to merge 1 commit into
Conversation
8b8122b to
d4e6ffb
Compare
theunrepentantgeek
requested changes
Jun 19, 2026
theunrepentantgeek
left a comment
Member
There was a problem hiding this comment.
First pass review, needs another look from someone more familiar with the codebase.
Both Windows2019 and WindowsAnnual are on their way out (going or already gone), so we can simplify this code by omitting those.
Plus, I have a question about the uniqueness of Windows NetBIOS names - I suspect there's not sufficient entropy to make naming collisions unlikely.
4f79f52 to
0bb9fff
Compare
Author
|
@theunrepentantgeek thanks for the thorough review — all of your comments are addressed and the conversations are resolved. Summary:
Follow-up beyond the review (worth a look):
Re-requesting review when you have a moment. 🙏 |
Enable Karpenter (Azure provider) to provision Windows worker nodes when
running in the AKS Machine API provision mode (PROVISION_MODE=aksmachineapi /
aksmachineapiheaderbatch). The Karpenter controller itself continues to run as
a Linux pod; only the provisioned worker nodes are Windows. This is the mode
AKS ships as managed Node Auto Provisioning (NAP), and the AKS RP / Node
Provisioning Service already handles Windows server-side (OSType/OSSKU,
AgentPoolWindowsProfile, and admin-credential sourcing from the cluster's
windowsProfile), so no RP-side changes are required.
Previously every code path was hard-coded to Linux: instance types advertised
kubernetes.io/os=linux unconditionally, there was no Windows image family, and
the Machine builder always emitted OSType=Linux. As a result the scheduler
never matched a Windows pod to any offering and no Windows node was ever
created.
What this change does
---------------------
Scheduling
- Derive the kubernetes.io/os requirement from the NodeClass image family via
a new v1beta1.GetOSForImageFamily helper, instead of hard-coding Linux, so a
Windows NodeClass advertises os=windows and the scheduler bin-packs Windows
pods onto provisionable nodes. OS values are defined as constants
(v1beta1.OSLinux / OSWindows) rather than magic strings.
NodeClass API (v1beta1)
- Add Windows image families: Windows2022 and Windows2025, each mapped 1:1 to
its os-sku label value (kubernetes.azure.com/os-sku), plus a WindowsFamilies
set and a case-insensitive v1beta1.IsWindowsImageFamily helper used by every
family check so casing never silently routes a Windows family down the Linux
path.
- Extend the imageFamily CEL enum and register the Windows os-sku values in
WellKnownValuesForRequirements.
- Reject FIPS and linuxOSConfig for Windows families via CEL validation; the
AKSNodeClass CRD (and the Helm CRD copy) are regenerated accordingly.
Image selection
- Add a Windows image family (pkg/providers/imagefamily/windows.go) that
returns Windows images from the AKS-managed shared image gallery (AKSWindows)
in gen2-then-gen1, amd64 order. Windows is SIG-only (no community gallery)
and has no FIPS variants. The bootstrap methods return a clear error because
Windows is only supported via the Machine API path, not the in-provider
bootstrappers.
- Include the AKSWindows gallery in FilteredNodeImages; it was previously
filtered to the Ubuntu and Azure Linux galleries only, which silently
dropped every Windows image.
- Convert Windows node image versions correctly: the AKSWindows image
definition prefix ("windows-") is stripped so the result matches the AKS
node-image-version form (e.g. AKSWindows-2022-containerd-gen2-<version>).
AKS Machine API provisioning
- Set OSType=Windows and OSSKU=Windows2022/Windows2025 for Windows NodeClasses,
omit the LinuxProfile, and force EnableFIPS=false.
- Omit NodeImageVersion for Windows machines: the RP's input parser splits on
"-" and expects exactly gallery-name-version, but Windows image names contain
hyphens, so an explicit value is rejected. Leaving it empty lets the RP
resolve the latest image from the OSSKU.
- Generate a short, deterministic, hyphen-free AKS machine name for Windows to
satisfy the Windows NetBIOS computer-name limit enforced by the Machine API.
The budget is pool-aware, mirroring the AKS RP machine-name validation: <=12
chars in the reserved NAP pool (aksmanagedap) and <=5 chars in a custom /
self-hosted machines pool (whose RP-composed VM name also embeds the pool
name). The name is split into dedicated GetLinuxAKSMachineName and
GetWindowsAKSMachineName(maxLen) helpers chosen at the call site; Linux
machine-name generation is unchanged.
Hyper-V generation (Windows)
- Request a Generation 2 Windows image from the RP, when the selected SKU
supports it, via the UseWindowsGen2VM header. In Machine API mode the RP
resolves the Windows image generation server-side and, for the
Windows2022/Windows2019 OSSKUs, defaults to Generation 1 unless this header is
set (it then rejects the create if the SKU does not support the requested
generation). Karpenter selects the cheapest compatible SKU, which is
frequently Gen2-only, so it sets the header exactly when the chosen SKU
advertises Gen2 (karpenter.azure.com/sku-hyperv-generation): Gen2-only and
dual-generation SKUs get Gen2 (preferred), Gen1-only SKUs fall back to the
RP's Gen1 default. This mirrors the gen2-then-gen1 preference of the
in-provider Windows image family and lets Windows NodePools provision on any
SKU generation. The header is threaded through both the standard and the
header-batch create paths (and participates in the batch key).
Hybrid clusters
- Always pin the Karpenter controller Deployment to Linux nodes
(kubernetes.io/os=linux), merged with any user-provided nodeSelector and
taking precedence over it. Now that Karpenter can provision Windows nodes,
this guarantees the Linux-only controller is never scheduled onto a Windows
node in a mixed-OS cluster.
Tests and tooling
- Unit tests for the OS/os-sku mapping, case-insensitive family matching, the
Windows image family and GetImageFamily wiring, Windows node-image-version
conversion, OSSKU/OSType selection in the Machine builder, the AKSWindows
gallery filter, the pool-aware Windows machine-name generator, and the
Gen2-image (UseWindowsGen2VM) decision and its batch-key/header plumbing.
- New e2e suite (test/suites/windows) that provisions a Windows node via the
Machine API and runs a Windows pod, asserting the node's os and os-sku
labels. It runs in Machine API mode against either the reserved NAP-managed
agent pool or a custom machines pool whose name is <=6 chars (the pool-aware
Windows machine-name budget). The Windows NodePool's SKU is intentionally left
unconstrained: because Karpenter requests a Gen2 image whenever the selected
SKU supports it, Windows provisions on any Hyper-V generation, including
Gen2-only sizes. Adds WindowsNodeClass/WindowsNodePool helpers, an
az-mkaks-windows cluster target with a windowsProfile, an AKSWindows SIG
reader role assignment, and makes the self-hosted machines-pool name
configurable.
Scope and notes
- Windows is amd64-only by design.
- Windows support targets the AKS Machine API path only; the in-provider
bootstrapping paths (aksscriptless/bootstrappingclient) continue to reject
Windows.
- GPU-on-Windows and Windows-specific max-pods defaults are intentionally left
for follow-up.
CI
- Wire the Windows e2e suite into the E2E matrix so it runs daily and on push.
Windows is special-cased in workflows/e2e.yaml: it always runs in AKS Machine
API mode (it is only provisionable that way and otherwise skips) on a dedicated
cluster (new ci-mkcluster-all-windows target -> az-mkaks-windows, Azure CNI
overlay + windowsProfile) because Windows does not support the Cilium dataplane
used by the default CI cluster, with a machines pool name <= 6 chars (winmp) for
the Windows machine-name limit. The ephemeral cluster's Windows admin password
is generated and masked in the create-cluster action (no new repository secret).
Non-Windows suites are unchanged.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
0bb9fff to
00ae112
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Enable Karpenter (Azure provider) to provision Windows worker nodes when running in the AKS Machine API provision mode (
PROVISION_MODE=aksmachineapi/aksmachineapiheaderbatch). The Karpenter controller continues to run as a Linux pod; only the provisioned worker nodes are Windows. This is the mode AKS ships as managed Node Auto Provisioning (NAP), and the AKS RP / Node Provisioning Service already handles Windows server-side (OSType/OSSKU,AgentPoolWindowsProfile, and admin-credential sourcing from the clusterwindowsProfile), so no RP-side changes are required.Previously every code path was hard-coded to Linux: instance types advertised
kubernetes.io/os=linuxunconditionally, there was no Windows image family, and the Machine builder always emittedOSType=Linux. As a result the scheduler never matched a Windows pod to any offering and no Windows node was ever created.What this PR does
Scheduling
kubernetes.io/osrequirement from the NodeClass image family (newv1beta1.GetOSForImageFamily) instead of hard-coding Linux, so a Windows NodeClass advertisesos=windowsand the scheduler bin-packs Windows pods.NodeClass API (v1beta1)
Windows2019,Windows2022,Windows2025,WindowsAnnualimage families, each mapped 1:1 to itskubernetes.azure.com/os-skuvalue, plus aWindowsFamiliesset.imageFamilyCEL enum, register the Windows os-sku values, and reject FIPS /linuxOSConfigfor Windows via CEL. CRD (and Helm CRD copy) regenerated.Image selection
AKSWindows), gen2-then-gen1, amd64; SIG-only, no FIPS.AKSWindowsgallery inFilteredNodeImages(previously filtered to Ubuntu/Azure Linux only, which silently dropped all Windows images).windows-image-definition prefix so the node-image-version matches the AKS form (e.g.AKSWindows-2022-containerd-gen2-<version>).AKS Machine API provisioning
OSType=Windowsand the WindowsOSSKU, omitLinuxProfile, forceEnableFIPS=false.NodeImageVersionfor Windows: the RP input parser splits on-and expectsgallery-name-version, but Windows names contain hyphens, so an explicit value is rejected; leaving it empty lets the RP resolve the latest image from the OSSKU.Hybrid clusters
kubernetes.io/os=linux), merged with and taking precedence over any usernodeSelector, so the Linux-only controller is never scheduled onto a Windows node in a mixed-OS cluster.Testing
GetImageFamilywiring, Windows node-image-version conversion,OSSKU/OSTypeselection in the Machine builder, theAKSWindowsgallery filter, and the Windows machine-name generator. Full suite green.test/suites/windows) that provisions a Windows node via the Machine API and runs a Windows pod, asserting the nodeosandos-skulabels. It skips unless running in Machine API mode against the reserved NAP-managed agent pool (the only place usable Windows machine names are allowed). AddsWindowsNodeClass/WindowsNodePoolhelpers, anaz-mkaks-windowscluster target with awindowsProfile, anAKSWindowsSIG reader role assignment, and a configurable self-hosted machines-pool name.windowsProfile,aksmachineapi,USE_SIG=true). The RP accepted a correct Windows machine (osType=Windows,osSKU=Windows2022, resolved image, zoneless, short name). End-to-end pod-Ready validation requires a NAP-managed cluster, since usable Windows machine names are only allowed in the reservedaksmanagedappool (which cannot be created in a self-hosted cluster) — hence the suite's skip guard.Scope & notes
aksscriptless/bootstrappingclient) continue to reject Windows.