Skip to content

Conversation

@arghosh93
Copy link
Contributor

This PR is to stop adding Egress IP to public load balancer
backend pool regardless of presence of an OutBoundRule in any
Azure cluster.

This change comes with a consequence of no outbound connectivity
except to the infrastructure subnet even if there is no OutBoundRule.

However this is required to tackle following situation:

- If an infra node is being used as an egressNode then health
check for egress IP also succeeds when it is added to public load
balancer and LB considers it as a legitimate ingress router backend.

- Limits the number of egress IP which can be created on a cluster
due to some Azure specific limitation.

this PR also let cobtroller remove any egress IP
added to public load balancer backend pool previously.

@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 26, 2025
@openshift-ci-robot
Copy link

@arghosh93: This pull request references Jira Issue OCPBUGS-57447, which is invalid:

  • expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This PR is to stop adding Egress IP to public load balancer
backend pool regardless of presence of an OutBoundRule in any
Azure cluster.

This change comes with a consequence of no outbound connectivity
except to the infrastructure subnet even if there is no OutBoundRule.

However this is required to tackle following situation:

  • If an infra node is being used as an egressNode then health
    check for egress IP also succeeds when it is added to public load
    balancer and LB considers it as a legitimate ingress router backend.

  • Limits the number of egress IP which can be created on a cluster
    due to some Azure specific limitation.

this PR also let cobtroller remove any egress IP
added to public load balancer backend pool previously.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@arghosh93
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 26, 2025
@openshift-ci-robot
Copy link

@arghosh93: This pull request references Jira Issue OCPBUGS-57447, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @huiran0826

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from huiran0826 September 26, 2025 11:19
@arghosh93 arghosh93 changed the title OCPBUGS-57447: Refrain from adding Egress IP to public LB backend pool OCPBUGS-57447,OCPBUGS-45056: Refrain from adding Egress IP to public LB backend pool Sep 26, 2025
@openshift-ci-robot openshift-ci-robot added jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. and removed jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Sep 26, 2025
@openshift-ci-robot
Copy link

@arghosh93: This pull request references Jira Issue OCPBUGS-57447, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @huiran0826

The bug has been updated to refer to the pull request using the external bug tracker.

This pull request references Jira Issue OCPBUGS-45056, which is invalid:

  • expected the bug to target either version "4.21." or "openshift-4.21.", but it targets "4.20.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This PR is to stop adding Egress IP to public load balancer
backend pool regardless of presence of an OutBoundRule in any
Azure cluster.

This change comes with a consequence of no outbound connectivity
except to the infrastructure subnet even if there is no OutBoundRule.

However this is required to tackle following situation:

  • If an infra node is being used as an egressNode then health
    check for egress IP also succeeds when it is added to public load
    balancer and LB considers it as a legitimate ingress router backend.

  • Limits the number of egress IP which can be created on a cluster
    due to some Azure specific limitation.

this PR also let cobtroller remove any egress IP
added to public load balancer backend pool previously.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@arghosh93
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 26, 2025
@openshift-ci-robot
Copy link

@arghosh93: This pull request references Jira Issue OCPBUGS-57447, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @huiran0826

This pull request references Jira Issue OCPBUGS-45056, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Member

@arkadeepsen arkadeepsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any tests which can be used to verify the changes made in this PR will correctly solve the issue?

@arghosh93
Copy link
Contributor Author

Do we have any tests which can be used to verify the changes made in this PR will correctly solve the issue?

We lack knowledge of API of different cloud providers to fake it. That is the main reason behind not having enough unit tests.

@pperiyasamy
Copy link
Member

@arghosh93 Does this PR introduce any limitations on pod egress traffic? From my understanding, if we skip adding the EgressIP to the load balancer backend pools, the egress traffic will be restricted to the infra subnet. Is that correct?

@arghosh93
Copy link
Contributor Author

@arghosh93 Does this PR introduce any limitations on pod egress traffic? From my understanding, if we skip adding the EgressIP to the load balancer backend pools, the egress traffic will be restricted to the infra subnet. Is that correct?

Yes @pperiyasamy , that is correct. The plan is to document this limitation along with a suggestion of using NAT gateway instead of a general purpose public load balancer. I am also gonna notify support team members so that everyone is well aware.

@pperiyasamy
Copy link
Member

pperiyasamy commented Oct 16, 2025

@arghosh93 Does this PR introduce any limitations on pod egress traffic? From my understanding, if we skip adding the EgressIP to the load balancer backend pools, the egress traffic will be restricted to the infra subnet. Is that correct?

Yes @pperiyasamy , that is correct. The plan is to document this limitation along with a suggestion of using NAT gateway instead of a general purpose public load balancer. I am also gonna notify support team members so that everyone is well aware.

Thanks @arghosh93 , If this is agreed by everyone, i'm fine with it. one comment on the sync function.

// backend pool regardless of the presence of an OutBoundRule.
// During upgrade this function removes any egress IP added to
// public load balancer backend pool previously.
func (a *Azure) SyncLBBackend(ip net.IP, node *corev1.Node) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tested this SyncLBBackend function with so many EIPs already in place ? because AFAIK Azure APIs are so slow and not sure how it works when you want to sync already existing IPs.
have you explored sync IPs belong to a node with single API call ? something similar to processing existing items (like this) before watching CloudPrivateIPConfig objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested with around 10 egress IPs. I have not seen much delay. We can ask QE to test with more egress IPs. There is a slack thread where ARO team did some testing with this PR.
https://redhat-internal.slack.com/archives/C09G14XDR9B/p1759942816358369?thread_ts=1759848001.402299&cid=C09G14XDR9B
Egress IPs are queued separately and may be difficult to obtain all at once. This is also a one time thing and expected to be run mostly during the upgrade.
I do not anticipate it taking much time and going beyond the upgrade completion time.

@pperiyasamy
Copy link
Member

/retest-required

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 30, 2025
This PR is to stop adding Egress IP to public load balancer
backend pool regardless of presence of an OutBoundRule in any
Azure cluster.

This change comes with a consequence of no outbound connectivity
except to the infrastructure subnet even if there is no OutBoundRule.

However this is required to tackle following situation:

- If an infra node is being used as an egressNode then health
check for egress IP also succeeds when it is added to public load
balancer and LB considers it as a legitimate ingress router backend.

- Limits the number of egress IP which can be created on a cluster
due to some Azure specific limitation.

Signed-off-by: Arnab Ghosh <[email protected]>
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 30, 2025
The consensus is to not add egress IP to public load balancer
backend pool regardless of the presence of an OutBoundRule.
During upgrade this PR let cobtroller removes any egress IP
added to public load balancer backend pool previously.

Signed-off-by: Arnab Ghosh <[email protected]>
@arghosh93
Copy link
Contributor Author

/retest-required

1 similar comment
@arghosh93
Copy link
Contributor Author

/retest-required

@arkadeepsen
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 31, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 31, 2025

@arghosh93: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-openstack-ovn-serial-e2e-only c2b5065 link false /test e2e-openstack-ovn-serial-e2e-only
ci/prow/e2e-aws-ovn-serial c2b5065 link false /test e2e-aws-ovn-serial
ci/prow/security 4d24cb6 link false /test security

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@arghosh93
Copy link
Contributor Author

/retest-required

@pperiyasamy
Copy link
Member

/lgtm
/hold

@kyrtapz is it worth to sync existing CloudPrivateIPConfigs before handling CloudPrivateIPConfig events ? it may optimize number of Azure API calls. Will remove /hold label based on his feedback. Thanks.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 4, 2025
@coderabbitai
Copy link

coderabbitai bot commented Nov 4, 2025

Walkthrough

Introduces SyncLBBackend method to CloudProviderIntf interface with provider-specific implementations. Azure implements backend pool cleanup logic with lock acquisition and network interface updates. AWS, GCP, and OpenStack provide no-op implementations. The controller calls this method during CloudPrivateIPConfig synchronization when no add/delete operation occurs.

Changes

Cohort / File(s) Summary
Interface definition
pkg/cloudprovider/cloudprovider.go
Added SyncLBBackend method to CloudProviderIntf interface with signature SyncLBBackend(ip net.IP, node *corev1.Node) error. Updated imports to include v1 cloudnetwork API.
No-op implementations
pkg/cloudprovider/aws.go, pkg/cloudprovider/gcp.go, pkg/cloudprovider/openstack.go, pkg/cloudprovider/cloudprovider_fake.go
Added no-op SyncLBBackend methods returning nil. AWS and GCP include comments that no Egress IP handling occurs; OpenStack indicates public LB backends are not modified.
Azure provider implementation
pkg/cloudprovider/azure.go
Removed backendAddressPoolClient field and initialization; added lbBackendPoolSynced flag. Implemented SyncLBBackend to acquire per-node lock, fetch network interface, strip LoadBalancerBackendAddressPools from IP configurations, update interface, and set sync flag. Reworked AssignPrivateIP to remove prior backend pool handling logic; adds warning when creating secondary IPs without backend pools. Removed helper methods related to backend pool lookup.
Controller integration
pkg/cloudprovider/cloudprivateipconfig/cloudprivateipconfig_controller.go
Modified SyncHandler NOOP branch to fetch the Node object and invoke cloudProviderClient.SyncLBBackend(ip, node), returning error if the call fails.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • Azure SyncLBBackend implementation: Requires careful review of per-node lock semantics, network interface update flow, and interaction with Azure API clients (armnetwork). Verify that IP configuration stripping logic correctly targets only the specific IP's LoadBalancerBackendAddressPools.
  • AssignPrivateIP rework in Azure: Confirm that removal of prior backend pool carve-out logic doesn't break existing IP assignment paths and that the warning log is appropriate for secondary IP creation.
  • Controller integration: Ensure NOOP path change (now performs SyncLBBackend instead of returning early) doesn't introduce unintended side effects or performance regressions in the reconciliation loop.
  • Interface consistency: Verify all provider implementations satisfy the interface contract and that no-op implementations are intentional for AWS, GCP, and OpenStack.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 4, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: arghosh93, arkadeepsen, pperiyasamy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 4, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 8384756 and 4d24cb6.

📒 Files selected for processing (7)
  • pkg/cloudprovider/aws.go (1 hunks)
  • pkg/cloudprovider/azure.go (6 hunks)
  • pkg/cloudprovider/cloudprovider.go (2 hunks)
  • pkg/cloudprovider/cloudprovider_fake.go (1 hunks)
  • pkg/cloudprovider/gcp.go (1 hunks)
  • pkg/cloudprovider/openstack.go (1 hunks)
  • pkg/controller/cloudprivateipconfig/cloudprivateipconfig_controller.go (1 hunks)

Comment on lines +310 to +355
if a.lbBackendPoolSynced {
// nothing to do. Return immediately if LB backend has already synced
return nil
}
ipc := ip.String()
klog.Infof("Acquiring node lock for modifying load balancer backend pool, node: %s, ip: %s", node.Name, ipc)
nodeLock := a.getNodeLock(node.Name)
nodeLock.Lock()
defer nodeLock.Unlock()
instance, err := a.getInstance(node)
if err != nil {
return err
}
networkInterfaces, err := a.getNetworkInterfaces(instance)
if err != nil {
return err
}
if networkInterfaces[0].Properties == nil {
return fmt.Errorf("nil network interface properties")
}
// Perform the operation against the first interface listed, which will be
// the primary interface (if it's defined as such) or the first one returned
// following the order Azure specifies.
networkInterface := networkInterfaces[0]
var loadBalanceerBackendPoolModified bool
// omit Egress IP from LB backend pool
ipConfigurations := networkInterface.Properties.IPConfigurations
for _, ipCfg := range ipConfigurations {
if ptr.Deref(ipCfg.Properties.PrivateIPAddress, "") == ipc &&
ipCfg.Properties.LoadBalancerBackendAddressPools != nil {
ipCfg.Properties.LoadBalancerBackendAddressPools = nil
loadBalanceerBackendPoolModified = true
}
}
if loadBalanceerBackendPoolModified {
networkInterface.Properties.IPConfigurations = ipConfigurations
poller, err := a.createOrUpdate(networkInterface)
if err != nil {
return err
}
if err = a.waitForCompletion(poller); err != nil {
return err
}
a.lbBackendPoolSynced = true
return nil
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Do not short-circuit backend cleanup after the first IP.

lbBackendPoolSynced is a single flag on the provider. After the very first IP update sets it to true, every later call returns immediately, so we never remove backend pool references for the rest of the cluster’s Egress IPs. That is a functional regression for any cluster with more than one IP (or any IP reconciled after the first). We need to track sync state per IP (or simply drop the early return and rely on idempotent updates) so that each IP can be cleaned up.

A minimal fix is to replace the boolean with per-IP bookkeeping:

 type Azure struct {
 	CloudProvider
 	platformStatus               *configv1.AzurePlatformStatus
 	resourceGroup                string
 	env                          azureapi.Environment
 	vmClient                     *armcompute.VirtualMachinesClient
 	virtualNetworkClient         *armnetwork.VirtualNetworksClient
 	networkClient                *armnetwork.InterfacesClient
 	nodeMapLock                  sync.Mutex
 	nodeLockMap                  map[string]*sync.Mutex
 	azureWorkloadIdentityEnabled bool
-	lbBackendPoolSynced          bool
+	lbBackendPoolSynced          map[string]bool
 }

Initialize the map where we build the Azure struct, and update the method to key off node.Name/ip.String():

-	if a.lbBackendPoolSynced {
-		return nil
-	}
+	if a.lbBackendPoolSynced == nil {
+		a.lbBackendPoolSynced = make(map[string]bool)
+	}
+	cacheKey := fmt.Sprintf("%s|%s", node.Name, ipc)
+	if a.lbBackendPoolSynced[cacheKey] {
+		return nil
+	}-		a.lbBackendPoolSynced = true
+		a.lbBackendPoolSynced[cacheKey] = true

Make sure to import "fmt" if it’s not already present. That preserves the “run once” optimization per IP while still cleaning every IP.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In pkg/cloudprovider/azure.go around lines 310-355 the code short-circuits
cleanup using a single boolean a.lbBackendPoolSynced causing only the first IP
to ever be cleaned; replace this single flag with per-IP bookkeeping (e.g.
map[string]bool keyed by node.Name+"/"+ip.String()) initialized where the Azure
struct is built, change the early-return check to look up the node/ip key, and
after a successful update set the map entry to true for that key; ensure the map
is properly created on struct initialization and import fmt if not already
present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants