Skip to content

Conversation

@tssurya
Copy link
Contributor

@tssurya tssurya commented Oct 19, 2025

See details on the commit message.

Tested on AWS: Steps: https://issues.redhat.com/browse/OCPBUGS-60806?focusedId=28194587&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-28194587

Before change:
cloud.network.openshift.io/egress-ipconfig: '[{"interface":"eni-00d18740718a0e5d3","ifaddr":{"ipv4":"10.0.0.0/19"},"capacity":{"ipv6":15}}]'

After change:
cloud.network.openshift.io/egress-ipconfig: '[{"interface":"eni-00d18740718a0e5d3","ifaddr":{"ipv4":"10.0.0.0/19"},"capacity":{"ipv4":0,"ipv6":15}}]'

note that the changes also are backwards compatible with OVN-Kubernetes which uses int and not pointers.
Perhaps a followup should be to also change https://github.com/ovn-kubernetes/ovn-kubernetes/blob/f077fdd127d82bce44a5404a78d4dd88fcf930e5/go-controller/pkg/clustermanager/egressip_controller.go#L1316 into pointers and one more thing to consider is how to not have unlimited capacity on ovn-kubernetes side for cloud since it doesn't make much sense there unlike baremetal. But that is a change in behaviour so that can be another fix.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 19, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 19, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 19, 2025
Today, in CNCC we store the capacity values as
integers:

type capacity struct {
  IPv4 int `json:"ipv4,omitempty"`
  IPv6 int `json:"ipv6,omitempty"`
  IP   int `json:"ip,omitempty"`
}

When capacity is full, CNCC sets the value to 0.
Also, depending on the platform it also ignores
setting fields it doesn't care about (example AWS
doesn't use IP, gcp and azure don't use IPv4 and IPv6).

However given we have omitempty set, this was omitting
the zero value in the annotation. When OVN-Kubernetes
reads this annotation it was then setting the capacity
to unlimited:

nodeEgressIPConfig := []nodeEgressIPConfiguration{
        {
            Capacity: Capacity{
                IP:   UnlimitedNodeCapacity,
                IPv4: UnlimitedNodeCapacity, --> we set this to maxint32
                IPv6: UnlimitedNodeCapacity,
            },
        },
    }

which is causing all EgressIPs to be
assigned to this node leading to:

status:
  conditions:
  - lastTransitionTime: "2025-10-06T19:24:24Z"
    message: "Error processing cloud assignment request, err: PrivateIpAddressLimitExceeded:
      Number of private addresses will exceed limit.\n\tstatus code: 400, request
      id: 457f4332-e9c4-44c9-bfcf-deeb5e7e43ce"

In this fix, what we really want is to remove omitempty
so that the zero capacity gets reflected correctly, however
doing so also means fields that are unset will also be zero
which can lead to confusion. Basically we are not able to
distinguish between unset field and 0 value fields.

Hence we are changing the capacity struct to be pointer type
values so that null/nil means unset and 0 means full capacity.
We still keep the omitempty since we don't need to do anything
with unset fields - there is no behaviour change there and
OVN-Kubernetes will continue to treat that as unlimited
capacity.

Upgrades: CNCC upon reboot seems to call:
func (n *NodeController) SyncHandler(key string) error {
....
	// Filter out cloudPrivateIPConfigs assigned to node (key) and write the entry
	// into same slice starting from index 0, finally chop off unwanted entries
	// when passing it into GetNodeEgressIPConfiguration.
	index := 0
	for _, cloudPrivateIPConfig := range cloudPrivateIPConfigs {
		if isAssignedCloudPrivateIPConfigOnNode(cloudPrivateIPConfig, key) {
			cloudPrivateIPConfigs[index] = cloudPrivateIPConfig
			index++
		}
	}
	nodeEgressIPConfigs, err := n.cloudProviderClient.GetNodeEgressIPConfiguration(node, cloudPrivateIPConfigs[:index])
	if err != nil {
		return fmt.Errorf("error retrieving the private IP configuration for node: %s, err: %v", node.Name, err)
	}
	return n.SetNodeEgressIPConfigAnnotation(node, nodeEgressIPConfigs)
}

// SetCloudPrivateIPConfigAnnotationOnNode annotates the corev1.Node with the cloud subnet information and capacity
func (n *NodeController) SetNodeEgressIPConfigAnnotation(node *corev1.Node, nodeEgressIPConfigs []*cloudprovider.NodeEgressIPConfiguration) error {
	annotation, err := n.generateAnnotation(nodeEgressIPConfigs)
	if err != nil {
		return err
	}
	klog.Infof("Setting annotation: '%s: %s' on node: %s", nodeEgressIPConfigAnnotationKey, annotation, node.Name)
	return retry.RetryOnConflict(retry.DefaultRetry, func() error {
		ctx, cancel := context.WithTimeout(n.ctx, controller.ClientTimeout)
		defer cancel()

		// See: updateCloudPrivateIPConfigStatus
		nodeLatest, err := n.kubeClient.CoreV1().Nodes().Get(ctx, node.Name, metav1.GetOptions{})
		if err != nil {
			return err
		}
		existingAnnotations := nodeLatest.Annotations
		existingAnnotations[nodeEgressIPConfigAnnotationKey] = annotation
		nodeLatest.SetAnnotations(existingAnnotations)
		_, err = n.kubeClient.CoreV1().Nodes().Update(ctx, nodeLatest, metav1.UpdateOptions{})
		return err
	})
}

and we seem to be overwriting the annotation - so we should be good on upgrades
in changing from older annotations to new annotations - where 0 valued fields
will appear for full capacity nodes.

Once that happens, OVN-Kubernetes should overrite the UnlimitedValue to value 0
tat indicates 0 capacity and we should enter:

			if eNode.egressIPConfig.Capacity.IP < util.UnlimitedNodeCapacity {
				if eNode.egressIPConfig.Capacity.IP-len(eNode.allocations) <= 0 {
					klog.V(5).Infof("Additional allocation on Node: %s exhausts it's IP capacity, trying another node", eNode.name)
					continue
				}
			}
			if eNode.egressIPConfig.Capacity.IPv4 < util.UnlimitedNodeCapacity && utilnet.IsIPv4(eIP) {
				if eNode.egressIPConfig.Capacity.IPv4-getIPFamilyAllocationCount(eNode.allocations, false) <= 0 {
					klog.V(5).Infof("Additional allocation on Node: %s exhausts it's IPv4 capacity, trying another node", eNode.name)
					continue
				}
			}
			if eNode.egressIPConfig.Capacity.IPv6 < util.UnlimitedNodeCapacity && utilnet.IsIPv6(eIP) {
				if eNode.egressIPConfig.Capacity.IPv6-getIPFamilyAllocationCount(eNode.allocations, true) <= 0 {
					klog.V(5).Infof("Additional allocation on Node: %s exhausts it's IPv6 capacity, trying another node", eNode.name)
					continue
				}
			}

these desired conditions correctly.

Signed-off-by: Surya Seetharaman <[email protected]>
@tssurya tssurya force-pushed the make-capacity-field-pointer branch from 69b85e9 to 258fd57 Compare October 19, 2025 19:39
@tssurya tssurya changed the title Change the capacity struct from int to ptrOfInt OCPBUGS-60806: Change the capacity struct from int to ptrOfInt Oct 19, 2025
@tssurya tssurya marked this pull request as ready for review October 19, 2025 22:12
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 19, 2025
@openshift-ci-robot
Copy link

@tssurya: This pull request references Jira Issue OCPBUGS-60806, which is invalid:

  • expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

See details on the commit message.

Tested on AWS: Steps: https://issues.redhat.com/browse/OCPBUGS-60806?focusedId=28194587&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-28194587

Before change:
cloud.network.openshift.io/egress-ipconfig: '[{"interface":"eni-00d18740718a0e5d3","ifaddr":{"ipv4":"10.0.0.0/19"},"capacity":{"ipv6":15}}]'

After change:
cloud.network.openshift.io/egress-ipconfig: '[{"interface":"eni-00d18740718a0e5d3","ifaddr":{"ipv4":"10.0.0.0/19"},"capacity":{"ipv4":0,"ipv6":15}}]'

note that the changes also are backwards compatible with OVN-Kubernetes which uses int and not pointers.
Perhaps a followup should be to also change https://github.com/ovn-kubernetes/ovn-kubernetes/blob/f077fdd127d82bce44a5404a78d4dd88fcf930e5/go-controller/pkg/clustermanager/egressip_controller.go#L1316 into pointers and one more thing to consider is how to not have unlimited capacity on ovn-kubernetes side for cloud since it doesn't make much sense there unlike baremetal. But that is a change in behaviour so that can be another fix.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Oct 19, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 19, 2025
@tssurya
Copy link
Contributor Author

tssurya commented Oct 19, 2025

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 19, 2025
@openshift-ci-robot
Copy link

@tssurya: This pull request references Jira Issue OCPBUGS-60806, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @qiowang721

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from qiowang721 October 19, 2025 22:13
@tssurya
Copy link
Contributor Author

tssurya commented Oct 19, 2025

@kyrtapz PTAL! I tried to add unit tests but not sure why I couldn't find any annotation related tests in CNCC? Maybe I didn't look at the right place?
But LMK what you think of the fix and hopefully I didn't miss any spots wrt changing it to pointer.

@qiowang721 @huiran0826 : could one of you please make sure this fixes the bug 60806 via pre-merge? Also let's make sure test coverage gap gets added to all platforms - aws, gcp, azure, openstack and let's also test upgrades - so reproduce the bug in older version and then upgrade and then make sure the capacity shows up as 0 and then when retrying that it doesn't happen again.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 20, 2025

@tssurya: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/security 258fd57 link false /test security
ci/prow/okd-scos-e2e-aws-ovn 258fd57 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@kyrtapz
Copy link
Contributor

kyrtapz commented Oct 20, 2025

note that the changes also are backwards compatible with OVN-Kubernetes which uses int and not pointers.
Perhaps a followup should be to also change https://github.com/ovn-kubernetes/ovn-kubernetes/blob/f077fdd127d82bce44a5404a78d4dd88fcf930e5/go-controller/pkg/clustermanager/egressip_controller.go#L1316 into pointers

While I agree that this change shouldn't break ovn-kubernetes I think we should converge the parsing sooner rather than later to avoid future issues, do you mind creating a followup in ovn-k? This shouldn't change the behavior that this change introduces.

one more thing to consider is how to not have unlimited capacity on ovn-kubernetes side for cloud since it doesn't make much sense there unlike baremetal. But that is a change in behaviour so that can be another fix.

The behavior is changing with this PR already, once we get it in there is no longer a possibility for a cloud deployment to have an unlimited capacity.
For reference GCP, Azure and OpenStack have predefined static capacity values - meaning that these were never meant to be unlimited anyway.
We read it from the interface config for AWS but looking deeper it is still something that is predefined based on the VM flavor: https://docs.aws.amazon.com/ec2/latest/instancetypes/gp.html#gp_network

So based on the fact that none of the supported CNCC providers allow for unlimited IPs I believe the defaulting in ovn-k is wrong and we should address it as a followup unless we want to claim that components other than CNCC use this annotation then we can stick to pointers that default to unlimited.

@kyrtapz
Copy link
Contributor

kyrtapz commented Oct 20, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 20, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 20, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kyrtapz, tssurya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tssurya
Copy link
Contributor Author

tssurya commented Oct 20, 2025

Opened ovn-kubernetes/ovn-kubernetes#5676

@qiowang721
Copy link

/verified by pre-merge testing

pre-merge tested on fresh installed cluster:

  1. create a cluster with PR build
  2. Label a node as egress node
  3. assign 14 external IPs on node via aws console, then restart CNCC
  4. check ipv4 capacity on node, it shows "ipv4":0 as expected
  5. create one more EIP, the EIP will not be assigned due to no capacity as expected
  6. check cloudprivateipconfigs, there is no entry for the new created EIP
  7. scale up another node in the same subnet, and label it with k8s.ovn.org/egress-assignable when it is Ready
  8. check the new added EIP will assign to the new added node, cloudprivateipconfig with CloudResponseSuccess

pre-merge tested for upgrade:

  1. create a cluster with 4.20 nightly build
  2. Label a node as egress node, say node1
  3. Configure additional IPs 14 to fill up the capacity
  4. Restart cncc, that ipv4 disappeared
  5. Create egressIPs 15 egressIPs, and got cloudprivateipconfig with CloudResponseError
  6. upgrade cluster to PR build
  7. after upgrade, checking the ip capacity for node1, it displays "ipv4":0 as expected

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Oct 23, 2025
@openshift-ci-robot
Copy link

@qiowang721: This PR has been marked as verified by pre-merge testing.

In response to this:

/verified by pre-merge testing

pre-merge tested on fresh installed cluster:

  1. create a cluster with PR build
  2. Label a node as egress node
  3. assign 14 external IPs on node via aws console, then restart CNCC
  4. check ipv4 capacity on node, it shows "ipv4":0 as expected
  5. create one more EIP, the EIP will not be assigned due to no capacity as expected
  6. check cloudprivateipconfigs, there is no entry for the new created EIP
  7. scale up another node in the same subnet, and label it with k8s.ovn.org/egress-assignable when it is Ready
  8. check the new added EIP will assign to the new added node, cloudprivateipconfig with CloudResponseSuccess

pre-merge tested for upgrade:

  1. create a cluster with 4.20 nightly build
  2. Label a node as egress node, say node1
  3. Configure additional IPs 14 to fill up the capacity
  4. Restart cncc, that ipv4 disappeared
  5. Create egressIPs 15 egressIPs, and got cloudprivateipconfig with CloudResponseError
  6. upgrade cluster to PR build
  7. after upgrade, checking the ip capacity for node1, it displays "ipv4":0 as expected

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot openshift-merge-bot bot merged commit 8384756 into openshift:main Oct 23, 2025
11 of 13 checks passed
@openshift-ci-robot
Copy link

@tssurya: Jira Issue Verification Checks: Jira Issue OCPBUGS-60806
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-60806 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

In response to this:

See details on the commit message.

Tested on AWS: Steps: https://issues.redhat.com/browse/OCPBUGS-60806?focusedId=28194587&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-28194587

Before change:
cloud.network.openshift.io/egress-ipconfig: '[{"interface":"eni-00d18740718a0e5d3","ifaddr":{"ipv4":"10.0.0.0/19"},"capacity":{"ipv6":15}}]'

After change:
cloud.network.openshift.io/egress-ipconfig: '[{"interface":"eni-00d18740718a0e5d3","ifaddr":{"ipv4":"10.0.0.0/19"},"capacity":{"ipv4":0,"ipv6":15}}]'

note that the changes also are backwards compatible with OVN-Kubernetes which uses int and not pointers.
Perhaps a followup should be to also change https://github.com/ovn-kubernetes/ovn-kubernetes/blob/f077fdd127d82bce44a5404a78d4dd88fcf930e5/go-controller/pkg/clustermanager/egressip_controller.go#L1316 into pointers and one more thing to consider is how to not have unlimited capacity on ovn-kubernetes side for cloud since it doesn't make much sense there unlike baremetal. But that is a change in behaviour so that can be another fix.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.21.0-0.nightly-2025-10-23-090257

@tssurya
Copy link
Contributor Author

tssurya commented Oct 24, 2025

/cherry-pick release-4.20

@openshift-cherrypick-robot

@tssurya: new pull request created: #185

In response to this:

/cherry-pick release-4.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants