Skip to content

feat: stateless cni compatibility with nodesubnet#3549

Closed
santhoshmprabhu wants to merge 11 commits intomasterfrom
sanprabhu/stateless-cni-nodesubnet
Closed

feat: stateless cni compatibility with nodesubnet#3549
santhoshmprabhu wants to merge 11 commits intomasterfrom
sanprabhu/stateless-cni-nodesubnet

Conversation

@santhoshmprabhu
Copy link
Copy Markdown
Contributor

@santhoshmprabhu santhoshmprabhu commented Apr 1, 2025

Reason for Change:
Stateless CNI is currently not compatible with nodesubnet, even with the changes that were recently incorporated into CNS for Cilium+Nodesubnet support. Specifically, IP configs returned from CNS in Nodesubnet mode do not populate NC-specific fields - ncGatewayIPAddress, ncSubnetPrefix, ncPrimarIP etc. The CNI today expects these values, since CNS is invoked only for Podsubnet/Overlay cases. With this PR, we have the following changes:

  1. Have a configuration option to configure the CNI for nodesubnet
  2. In nodesubnet mode, the CNI uses hostGateway and subnet to configure the pods
  3. The conflist for Azure CNI Nodesubnet on both windows and linux additionally specifies the mode as "nodesubnet". Today, the mode is specified only for overlay. Nodesubnet uses azure-vnet-ipam as the IPAM type, and podsubnet uses azure-cns as IPAM type with mode unspecified. With this change, cniv1 clusters can continue using azure-vnet-ipam (mode is ignored), whenever we deploy stateless CNI with nodesubnet, we'll use azure-cns as the type, and nodesubnet as the mode. Making this change here allows us to leave AgentBaker CSE scripts untouched.
  4. Doesn't block pod deletion if HNS network is not found. Rather, we log the event and proceed with pod deletion. Effectively, this allows deleting pods that never got an IP. This is an enhancement to help debug issues more easily.

The PR touches code that appears to have TODO items to clean up the Overlay configuration parameter. I'm planning to do that separately once I validate that the new parameter is fully rolled out.

Issue Fixed:

Requirements:

Notes:

Copilot AI review requested due to automatic review settings April 1, 2025 21:24
@santhoshmprabhu santhoshmprabhu requested review from a team as code owners April 1, 2025 21:24
@santhoshmprabhu santhoshmprabhu requested a review from vipul-21 April 1, 2025 21:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Stateless CNI compatibility with nodesubnet has been added to allow node subnet-based pod IP configuration and non-blocking pod deletion when the HNS network is missing.

  • Update endpoint deletion to return nil when no HNS id is found instead of an error
  • Introduce a new IpamMode ("Nodesubnet") and adjust the IP configuration logic accordingly
  • Extend unit tests to cover default add result configuration in nodesubnet mode

Reviewed Changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated no comments.

File Description
network/endpoint_windows.go Adjusted deletion logic to skip HNS deletion errors in order to allow pod deletion when HNS id is missing.
cni/util/const.go Added a new IpamMode "Nodesubnet" to support the nodesubnet configuration.
cni/network/invoker_cns_test.go Added tests to validate the default add result configuration when running in nodesubnet mode.
cni/network/invoker_cns.go Refactored the configureDefaultAddResult function to handle nodesubnet mode by removing the overlayMode arg and updating the gateway logic.
Files not reviewed (1)
  • hack/aks/Makefile: Language not supported
Comments suppressed due to low confidence (2)

network/endpoint_windows.go:534

  • Returning nil instead of an error when HNS id is missing is intentional per the design, but please add an inline comment explaining why a nil return is acceptable as it might be unexpected.
return nil

cni/network/invoker_cns.go:411

  • [nitpick] The error message for an invalid gateway address could include additional context (such as the ipam mode) to improve debuggability. Consider updating the message to clarify the expected behavior in different modes.
if podGateway == nil {

@santhoshmprabhu santhoshmprabhu self-assigned this Apr 1, 2025
@santhoshmprabhu santhoshmprabhu added cni Related to CNI. windows labels Apr 1, 2025
@santhoshmprabhu
Copy link
Copy Markdown
Contributor Author

/azp run Azure Container Networking PR

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@santhoshmprabhu
Copy link
Copy Markdown
Contributor Author

/azp run Azure Container Networking PR

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@santhoshmprabhu
Copy link
Copy Markdown
Contributor Author

/azp run Azure Container Networking PR

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

behzad-mir
behzad-mir previously approved these changes Apr 2, 2025
Copy link
Copy Markdown
Contributor

@behzad-mir behzad-mir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment thread network/endpoint_windows.go Outdated
if ep.HnsId == "" {
logger.Error("No HNS id found. Skip endpoint deletion", zap.Any("nicType", ep.NICType), zap.String("containerId", ep.ContainerID))
return fmt.Errorf("No HNS id found. Skip endpoint deletion for nicType %v, containerID %s", ep.NICType, ep.ContainerID) //nolint
return nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding a comment about why we are returning nil here would be helpful

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it could return an error of some known "NotFound" type, and the caller could tolerate that, instead

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we want to allow this behavior, at least according to this UT:

func TestDeleteEndpointImplHnsV2WithEmptyHNSID(t *testing.T) {

@paulyufan mentioned that this UT is run manually locally, not as part of the PR pipeline. The UT fails on master, and needs this change to pass actually. Possibly deleteEndpointImpl drifted from intended behavior by mistake?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm why this isn't run in the PR pipeline? because it's in a _windows test?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about that actually. We do have a Windows Tests stage in the PR pipeline of course. Maybe @paulyufan could weigh in?

Copy link
Copy Markdown
Contributor

@paulyufan2 paulyufan2 Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rbtr , TestDeleteEndpointImplHnsV2WithEmptyHNSID() is part of _windows UT test cases and PR pipeline does not run them currently. John confirmed we only run npm windows UTs on ACN pipeline but not for others

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below contains UT(s) we run for windows in the pipeline.

go test -timeout 30m ./npm/... ./cni/... ./platform/...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rbtr, we can expand the scope for our Windows test to include this (and other) test(s), but given this info, what are your thoughts on the diff above?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unchanged from my first comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rbtr addressed, please take a look.

Comment thread cni/network/invoker_cns.go Outdated
@santhoshmprabhu
Copy link
Copy Markdown
Contributor Author

/azp run Azure Container Networking PR

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@santhoshmprabhu
Copy link
Copy Markdown
Contributor Author

/azp run Azure Container Networking PR

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

// TODO: Remove v4overlay and dualstackoverlay options, after 'overlay' rolls out in AKS-RP
if !overlayMode {
podGateway := net.ParseIP(info.ncGatewayIPAddress)
// TODO: Remove v4overlay and dualstackoverlay options, after 'overlay' rolls out in AKS-RP
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, let's remove these ipam modes selection later, believe overlay has rollout in aks-rp. @pjohnst5 can you confirm this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Brand new cluster - it seems this isn't rolled out.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't yet rolled out because in CNI 1.4 we don't have the unified overlay parameter. We could possibly update the CNS config map in a way that preserved v4overlay only in 1.4 and unifies the option in 1.5 and 1.6, but that has not been done likely to avoid complicating the CNS config logic. There's a separate ongoing conversation about how this needs to be handled in AgentBaker for Windows nodes.

@santhoshmprabhu
Copy link
Copy Markdown
Contributor Author

Had a chat with @tamilmani1989. He suggested to keep CNI behavior independent of IPAM modes, which means moving the handling of Nodesubnet case into CNS. Closing this PR, will open a separate one for CNS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cni Related to CNI. windows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants