feat: stateless cni compatibility with nodesubnet#3549
feat: stateless cni compatibility with nodesubnet#3549santhoshmprabhu wants to merge 11 commits intomasterfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
Stateless CNI compatibility with nodesubnet has been added to allow node subnet-based pod IP configuration and non-blocking pod deletion when the HNS network is missing.
- Update endpoint deletion to return nil when no HNS id is found instead of an error
- Introduce a new IpamMode ("Nodesubnet") and adjust the IP configuration logic accordingly
- Extend unit tests to cover default add result configuration in nodesubnet mode
Reviewed Changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| network/endpoint_windows.go | Adjusted deletion logic to skip HNS deletion errors in order to allow pod deletion when HNS id is missing. |
| cni/util/const.go | Added a new IpamMode "Nodesubnet" to support the nodesubnet configuration. |
| cni/network/invoker_cns_test.go | Added tests to validate the default add result configuration when running in nodesubnet mode. |
| cni/network/invoker_cns.go | Refactored the configureDefaultAddResult function to handle nodesubnet mode by removing the overlayMode arg and updating the gateway logic. |
Files not reviewed (1)
- hack/aks/Makefile: Language not supported
Comments suppressed due to low confidence (2)
network/endpoint_windows.go:534
- Returning nil instead of an error when HNS id is missing is intentional per the design, but please add an inline comment explaining why a nil return is acceptable as it might be unexpected.
return nil
cni/network/invoker_cns.go:411
- [nitpick] The error message for an invalid gateway address could include additional context (such as the ipam mode) to improve debuggability. Consider updating the message to clarify the expected behavior in different modes.
if podGateway == nil {
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| if ep.HnsId == "" { | ||
| logger.Error("No HNS id found. Skip endpoint deletion", zap.Any("nicType", ep.NICType), zap.String("containerId", ep.ContainerID)) | ||
| return fmt.Errorf("No HNS id found. Skip endpoint deletion for nicType %v, containerID %s", ep.NICType, ep.ContainerID) //nolint | ||
| return nil |
There was a problem hiding this comment.
I think adding a comment about why we are returning nil here would be helpful
There was a problem hiding this comment.
maybe it could return an error of some known "NotFound" type, and the caller could tolerate that, instead
There was a problem hiding this comment.
It seems we want to allow this behavior, at least according to this UT:
@paulyufan mentioned that this UT is run manually locally, not as part of the PR pipeline. The UT fails on master, and needs this change to pass actually. Possibly deleteEndpointImpl drifted from intended behavior by mistake?
There was a problem hiding this comment.
hm why this isn't run in the PR pipeline? because it's in a _windows test?
There was a problem hiding this comment.
Not sure about that actually. We do have a Windows Tests stage in the PR pipeline of course. Maybe @paulyufan could weigh in?
There was a problem hiding this comment.
Hi @rbtr , TestDeleteEndpointImplHnsV2WithEmptyHNSID() is part of _windows UT test cases and PR pipeline does not run them currently. John confirmed we only run npm windows UTs on ACN pipeline but not for others
There was a problem hiding this comment.
Below contains UT(s) we run for windows in the pipeline.
There was a problem hiding this comment.
@rbtr, we can expand the scope for our Windows test to include this (and other) test(s), but given this info, what are your thoughts on the diff above?
There was a problem hiding this comment.
unchanged from my first comment
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| // TODO: Remove v4overlay and dualstackoverlay options, after 'overlay' rolls out in AKS-RP | ||
| if !overlayMode { | ||
| podGateway := net.ParseIP(info.ncGatewayIPAddress) | ||
| // TODO: Remove v4overlay and dualstackoverlay options, after 'overlay' rolls out in AKS-RP |
There was a problem hiding this comment.
yeah, let's remove these ipam modes selection later, believe overlay has rollout in aks-rp. @pjohnst5 can you confirm this?
There was a problem hiding this comment.
This isn't yet rolled out because in CNI 1.4 we don't have the unified overlay parameter. We could possibly update the CNS config map in a way that preserved v4overlay only in 1.4 and unifies the option in 1.5 and 1.6, but that has not been done likely to avoid complicating the CNS config logic. There's a separate ongoing conversation about how this needs to be handled in AgentBaker for Windows nodes.
|
Had a chat with @tamilmani1989. He suggested to keep CNI behavior independent of IPAM modes, which means moving the handling of Nodesubnet case into CNS. Closing this PR, will open a separate one for CNS. |

Reason for Change:
Stateless CNI is currently not compatible with nodesubnet, even with the changes that were recently incorporated into CNS for Cilium+Nodesubnet support. Specifically, IP configs returned from CNS in Nodesubnet mode do not populate NC-specific fields - ncGatewayIPAddress, ncSubnetPrefix, ncPrimarIP etc. The CNI today expects these values, since CNS is invoked only for Podsubnet/Overlay cases. With this PR, we have the following changes:
The PR touches code that appears to have TODO items to clean up the Overlay configuration parameter. I'm planning to do that separately once I validate that the new parameter is fully rolled out.
Issue Fixed:
Requirements:
Notes: