NO-JIRA: Branch Sync release-4.19 to release-4.18 [11-14-2025] #2861

openshift-pr-manager · 2025-11-14T22:36:17Z

Automated branch sync: release-4.19 to release-4.18.

When processing pods during an EgressIP status update, the controller used to stop iterating as soon as it encountered a pod in Pending state (in my case, pod IPs are not found when pod is in pending state with container creating status). This caused any subsequent Running pods to be skipped, leaving their SNAT entries unprogrammed on the egress node. With this change, only Pending pods are skipped, while iteration continues for the rest. This ensures that Running pods are properly processed and their SNAT entries are programmed. This change also skips pods that are unscheduled or use host networking. Signed-off-by: Periyasamy Palanisamy <[email protected]> (cherry picked from commit 2afbaf6)

…2831-to-release-4.19 [release-4.19] OCPBUGS-63660: Skip Pending pods in EgressIP status updates

When multiple networks support was first added, all controllers that were added used the label "Secondary" to indicate they were not "Default". When UDN was added, it allowed "Secondary" networks to function as the primary network for a pod, creating terminology confusion. We now treat non-default networks all as "User-Defined Networks". This commit changes all naming to conform to the latter. The only places secondary is used now is for distinguishing whether or not a UDN is acting as a primary or secondary network for a pod (it's role). The only exception to this is udn-isolation. I did not touch this because it relies on dbIDs, which would impact functionality for upgrade. There is no functional change in this commit. Signed-off-by: Tim Rozet <[email protected]> (cherry picked from commit bbca874)

The k8s e2e utility functions AddOrUpdateLabelOnNode/RemoveLabelOffNode don't work for labels without a value. The incorrect handling of these labels caused an incorrect sequence of nodes whem migrating different than what the tests intended to test. Signed-off-by: Jaime Caamaño Ruiz <[email protected]> (cherry picked from commit 434b48f)

There's two circumstances when IPs were being released incorrectly: * when a live migratable pod completed with no migration ongoing it was not being released due to IsMigratedSourcePodStale outright assuming a completed pod was stale. * when a live migratable pod completed on a different node than the VM's original as part of a migration it was being released when it shouldn't, we were simply not checking if it was a migration. It also improves the tests to check for IP release. Signed-off-by: Jaime Caamaño Ruiz <[email protected]> (cherry picked from commit 4c34982)

Don't attempt to release IPs that are not managed by the local zone which can happen with live migratable pods, otherwise we would get distracting error logs on release. Signed-off-by: Jaime Caamaño Ruiz <[email protected]> (cherry picked from commit 7a155cc)

ConditionalIPRelease would always return false when checking IPs not tracked in the local zone so in that case we were not correctly checking for colliding pods. This was hidden by the fact that IsMigratedSourcePodStale was used just before instead of AllVMPodsAreCompleted until a very recent fix and that would always return false for a completed live migratable pod. Signed-off-by: Jaime Caamaño Ruiz <[email protected]> (cherry picked from commit 0dc8f27)

Or completion of a failed target pod Signed-off-by: Jaime Caamaño Ruiz <[email protected]> (cherry picked from commit c1b02b5)

As it is the most complex scenario and a superset of testing without it Signed-off-by: Jaime Caamaño Ruiz <[email protected]> (cherry picked from commit ef92f78)

I accidentally removed the check in recent PR [1] which could have performance consequences as checking agains other pods has a cost. Reintroduce the check with a hopefully useful comment to prevent it form happening again. [1] ovn-kubernetes/ovn-kubernetes#5626 Signed-off-by: Jaime Caamaño Ruiz <[email protected]> (cherry picked from commit 76f6439)

…2801-to-release-4.19 [release-4.19] OCPBUGS-64645: kubevirt: fix bad release of IPs of live migratable pods

…odes Previously, we were checking if the next hop IP is valid for the current set of nodes but we werent but every EgressIP is assigned to a subset of the total nodes. Stale LRPs could occur if a node hosted eip pods, ovnkube-controller is down, and the EIP moved to a new Node which said controller is down. Signed-off-by: Martin Kennelly <[email protected]> Signed-off-by: Periyasamy Palanisamy <[email protected]> (cherry picked from commit d2b7cbe) (cherry picked from commit 688885f)

For IC mode, there is no expectation we can fetch a remote nodes LSP, therefore, by skipping (continue), it is causing us to skip generating valid next hops for the remote node. Later in sync LRPs, when a valid next hop is inspected, we do not find it valid and remove that valid next hop. Handlers will re-add it shortly later. Signed-off-by: Martin Kennelly <[email protected]> (cherry picked from commit a46e0e7) (cherry picked from commit 6f6edf3)

Signed-off-by: Martin Kennelly <[email protected]> (cherry picked from commit 2735d6b) (cherry picked from commit 562c749)

Previous to this change, we dont emit log error for stale next hops. Signed-off-by: Martin Kennelly <[email protected]> (cherry picked from commit ab082bd) (cherry picked from commit c337b16)

Scenario: - Nodes: node-1, node-2, node-3 - Egress IPs: EIP-1 - Pods: pod1 on node-1, pod2 on node-3 (pods are created via deployment replicas) - Egress-assignable nodes: node-1, node-2 - EIP-1 assigned to node-1 During a simultaneous reboot of node-1 and node-2, EIP-1 failed over to node-2 and ovnkube-controller restarted at nearly the same time: 1) EIP-1 was reassigned to node-2 by the cluster manager. 2) The sync EIP happened for EIP1 with stale status, though it cleaned SNATs/LRPs referring to node-1 due to outdated pod IPs (this is because pods will be recreated due to node reboots). 3) pod1/pod2 Add events arrived while the informer cache still had the old EIP status, so new SNATs/LRPs were created pointing to node-1. 4) The EIP-1 Add event arrived with the new status; entries for node-2 were added/updated. 5) Result: stale SNATs and LRPs with stale nexthops for node-1 remained. Fix: - Populate pod EIP status during EgressIP sync so podAssignment has accurate egressStatuses. - Reconcile stale assignments using podAssignment (egressStatuses) when the informer cache is not up to date, ensuring SNAT/LRP for the previously assigned node are corrected. - Remove stale EIP SNAT entries for remote-zone pods accordingly. - Add coverage for simultaneous EIP failover and controller restart. Signed-off-by: Periyasamy Palanisamy <[email protected]> (cherry picked from commit 1667a51) (cherry picked from commit 7060af6)

During an ovnkube-controller restart, pod add/remove events for EgressIP-served pods may occur before the factory.egressIPPod handler is registered in the watch factory. As a result, the EIP controller never able to handle pod delete, leading to stale logical router policy (LRP) entry. Scenario: ovnkube-controller starts. The EIP controller processes the namespace add event (oc.WatchEgressIPNamespaces) and creates an LRP entry for the served pod. The pod is deleted. The factory.egressIPPod handler registration happens afterward via oc.WatchEgressIPPods. The pod delete event is never processed by the EIP controller. Fix: 1. Start oc.WatchEgressIPPods followed by oc.WatchEgressIPNamespaces. 2. Sync EgressIPs before registering factory.egressIPPod event handler. 3. Removal of Sync EgressIPs for factory.EgressIPNamespaceType which is no longer needed. Signed-off-by: Periyasamy Palanisamy <[email protected]> (cherry picked from commit 8975b00) (cherry picked from commit b8303a2)

When the EIP controller cleans up a stale EIP assignment for a pod, it also removes the pod object from the podAssignment cache. This is incorrect, as it prevents the EIP controller from processing the subsequent pod delete event. Scenario: 1. pod-1 is served by eip-1, both hosted on node1. 2. node1’s ovnkube-controller restarts. 3. Pod add event is received by the EIP controller — no changes. 4. eip-1 moves from node1 to node0. 5. The EIP controller receives the eip-1 add event. 6. eip-1 cleans up pod-1’s stale assignment (SNAT and LRP) for node1, but removes the pod object from the podAssignment cache when no other assignments found. 7. The EIP controller programs the LRP entry with node0’s transit IP as the next hop, but the pod assignment cache is not updated with new podAssignmentState. 8. The pod delete event is received by the EIP controller but ignored, since the pod object is missing from the assignment cache. So this commit fixes the issue by adding podAssignmentState back into podAssignment cache at step 7. Signed-off-by: Periyasamy Palanisamy <[email protected]> (cherry picked from commit 16dedd1) (cherry picked from commit f4e2c17)

…start_4.19 [release-4.19] OCPBUGS-64854: Fix stale EIP assignments during failover and controller restart

…4.19-to-release-4.18-11-14-2025

openshift-pr-manager · 2025-11-14T22:36:17Z

/ok-to-test
/payload 4.18 ci blocking
/payload 4.18 nightly blocking

openshift-ci-robot · 2025-11-14T22:36:20Z

@openshift-pr-manager[bot]: This pull request explicitly references no jira issue.

In response to this:

Automated branch sync: release-4.19 to release-4.18.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-11-14T22:36:21Z

@openshift-pr-manager[bot]: trigger 4 job(s) of type blocking for the ci release of OCP 4.18

periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-aws-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-azure-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.18-e2e-gcp-ovn-upgrade
periodic-ci-openshift-hypershift-release-4.18-periodics-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/563c7850-c1aa-11f0-87f3-38cfbf37e86e-0

trigger 10 job(s) of type blocking for the nightly release of OCP 4.18

periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial
periodic-ci-openshift-release-master-ci-4.18-e2e-aws-upgrade-ovn-single-node
periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview
periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview-serial
periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-gcp-ovn-rt-upgrade
periodic-ci-openshift-hypershift-release-4.18-periodics-e2e-aws-ovn-conformance
periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-bm
periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/563c7850-c1aa-11f0-87f3-38cfbf37e86e-1

openshift-ci · 2025-11-14T22:36:33Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: openshift-pr-manager[bot]

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jluhrsen · 2025-11-16T06:43:41Z

/test e2e-aws-ovn-windows
/test e2e-metal-ipi-ovn-dualstack

jluhrsen · 2025-11-16T06:44:48Z

/payload-job periodic-ci-openshift-release-master-ci-4.18-e2e-aws-upgrade-ovn-single-node
/payload-job periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview-serial
/payload-job periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-gcp-ovn-rt-upgrade

openshift-ci · 2025-11-16T06:44:52Z

@jluhrsen: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.18-e2e-aws-upgrade-ovn-single-node
periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview-serial
periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-gcp-ovn-rt-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/bf9b8710-c2b7-11f0-93a0-08da60c4d8f3-0

jluhrsen · 2025-11-16T16:25:33Z

/test e2e-aws-ovn-windows

jluhrsen · 2025-11-16T16:26:14Z

/payload-job periodic-ci-openshift-release-master-ci-4.18-e2e-aws-upgrade-ovn-single-node
/payload-job periodic-ci-openshift-release-master-ci-4.18-e2e-gcp-ovn-upgrade
/payload-job periodic-ci-openshift-hypershift-release-4.18-periodics-e2e-aws-ovn

openshift-ci · 2025-11-16T16:26:20Z

@jluhrsen: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.18-e2e-aws-upgrade-ovn-single-node
periodic-ci-openshift-release-master-ci-4.18-e2e-gcp-ovn-upgrade
periodic-ci-openshift-hypershift-release-4.18-periodics-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f8c1f1e0-c308-11f0-8359-5a05bab404da-0

jluhrsen · 2025-11-21T03:47:33Z

/verified by ci

this PR looks good to go. I think it's @tssurya may not be around to label it. Maybe @kyrtapz or @jcaamano can take a look?

openshift-ci-robot · 2025-11-21T03:47:46Z

@jluhrsen: This PR has been marked as verified by ci.

In response to this:

/verified by ci

this PR looks good to go. I think it's @tssurya may not be around to label it. Maybe @kyrtapz or @jcaamano can take a look?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

pperiyasamy and others added 20 commits October 29, 2025 09:54

Merge pull request #2834 from openshift-cherrypick-robot/cherry-pick-…

6c9428e

…2831-to-release-4.19 [release-4.19] OCPBUGS-63660: Skip Pending pods in EgressIP status updates

kubevirt: test OVN DB after completion of source pod

14b97fe

Or completion of a failed target pod Signed-off-by: Jaime Caamaño Ruiz <[email protected]> (cherry picked from commit c1b02b5)

kubevirt: test with per-pod SNATs

8a31820

As it is the most complex scenario and a superset of testing without it Signed-off-by: Jaime Caamaño Ruiz <[email protected]> (cherry picked from commit ef92f78)

Merge pull request #2842 from openshift-cherrypick-robot/cherry-pick-…

ece9583

…2801-to-release-4.19 [release-4.19] OCPBUGS-64645: kubevirt: fix bad release of IPs of live migratable pods

OVN EIP: add UT for syncing next hops for v4/v6

8049fdf

Signed-off-by: Martin Kennelly <[email protected]> (cherry picked from commit 2735d6b) (cherry picked from commit 562c749)

OVN EIP: fix printing of stale next hops value

c17fec2

Previous to this change, we dont emit log error for stale next hops. Signed-off-by: Martin Kennelly <[email protected]> (cherry picked from commit ab082bd) (cherry picked from commit c337b16)

Merge pull request #2848 from pperiyasamy/eip_failover_ovnkubenode_re…

5f70205

…start_4.19 [release-4.19] OCPBUGS-64854: Fix stale EIP assignments during failover and controller restart

Merge remote-tracking branch 'origin/release-4.19' into sync-release-…

d995042

…4.19-to-release-4.18-11-14-2025

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 14, 2025

openshift-ci bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Nov 14, 2025

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Nov 21, 2025

NO-JIRA: Branch Sync release-4.19 to release-4.18 [11-14-2025] #2861

Are you sure you want to change the base?

NO-JIRA: Branch Sync release-4.19 to release-4.18 [11-14-2025] #2861

Uh oh!

Conversation

openshift-pr-manager bot commented Nov 14, 2025

Uh oh!

openshift-pr-manager bot commented Nov 14, 2025

Uh oh!

openshift-ci-robot commented Nov 14, 2025

Uh oh!

openshift-ci bot commented Nov 14, 2025

Uh oh!

openshift-ci bot commented Nov 14, 2025

Uh oh!

jluhrsen commented Nov 16, 2025

Uh oh!

jluhrsen commented Nov 16, 2025

Uh oh!

openshift-ci bot commented Nov 16, 2025

Uh oh!

jluhrsen commented Nov 16, 2025

Uh oh!

jluhrsen commented Nov 16, 2025

Uh oh!

openshift-ci bot commented Nov 16, 2025

Uh oh!

jluhrsen commented Nov 21, 2025

Uh oh!

openshift-ci-robot commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants