[DNM] Troubleshoot e2e-aws-ovn-windows CI failure #2761

pperiyasamy · 2025-09-18T12:31:31Z

No description provided.

sometimes a tmp pv image file used for setting up the runner gets leftover in /mnt. Add the removal of that in the "Free up disk space" step. Also move that step to a composite action to deduplicate it for easier maintenance Also added the Runner Diagnostics step as the first step that runs for every job Signed-off-by: Jamo Luhrsen <[email protected]>

When multiple networks support was first added, all controllers that were added used the label "Secondary" to indicate they were not "Default". When UDN was added, it allowed "Secondary" networks to function as the primary network for a pod, creating terminology confusion. We now treat non-default networks all as "User-Defined Networks". This commit changes all naming to conform to the latter. The only places secondary is used now is for distinguishing whether or not a UDN is acting as a primary or secondary network for a pod (it's role). The only exception to this is udn-isolation. I did not touch this because it relies on dbIDs, which would impact functionality for upgrade. There is no functional change in this commit. Signed-off-by: Tim Rozet <[email protected]>

Rabbit found these during code review. Signed-off-by: Tim Rozet <[email protected]>

Moves ACL DBIDs from "Secondary" to "PrimaryUDN". Signed-off-by: Tim Rozet <[email protected]>

Signed-off-by: Tim Rozet <[email protected]>

openshift-ci · 2025-09-18T13:04:35Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pperiyasamy
Once this PR has been reviewed and has the lgtm label, please assign knobunc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Fixes regression from 1448d5a The previous commit dropped matching on in_port so that localnet ports would also use table 1. This allows reply packets from a localnet pod towards the shared OVN/LOCAL IP to be sent to the correct port. However, a regression was introduced where traffic coming from these localnet ports to any destination would be sent to table 1. Egress traffic from the localnet ports is not committed to conntrack, so by sending to table=1 via CT we were getting a miss. This is especially bad for hardware offload where a localnet port is being used as the Geneve encap port. In this case all geneve traffic misses in CT lookup and is not offloaded. Table 1 is intended to be for handling IP traffic destined to the shared Gateway IP/MAC that both the Host and OVN use. It is also used to handle reply traffic for Egress IP. To fix this problem, we can add dl_dst match criteria to this flow, ensuring that only traffic destined to the Host/OVN goes to table 1. Furthermore, after fixing this problem there still exists the issue that localnet -> host/OVN egress traffic will still enter table 1 and CT miss. Potentially this can be fixed with always committing egress traffic, but it might have performance penalty, so deferring that fix to a later date. Signed-off-by: Tim Rozet <[email protected]>

We did this for IPv4 in 1448d5a, but forgot about IPv6. Signed-off-by: Riccardo Ravaioli <[email protected]>

Add dl_dst=$breth0 to table=0, prio=50 for IPv6 We want to match in table=1 only conntrack'ed reply traffic whose next hop is either OVN or the host. As a consequence, localnet traffic whose next hop is an external router (and that might or might not be destined to OVN/host) should bypass table=1 and just hit the NORMAL flow in table=0. Signed-off-by: Riccardo Ravaioli <[email protected]>

Signed-off-by: Riccardo Ravaioli <[email protected]>

We already tested localnet -> host, let's also cover connections initiated from the host. The localnet uses IPs in the same subnet as the host network. Signed-off-by: Riccardo Ravaioli <[email protected]>

We have two non-InterConnect CI lanes for multihoming, while only one with IC enabled (and local gw). We need coverage with IC enabled for both gateway modes, so let's make an existing non-IC lane IC enabled, set it as dualstack and gateway=shared to have better coverage. Signed-off-by: Riccardo Ravaioli <[email protected]>

Signed-off-by: Riccardo Ravaioli <[email protected]>

This is needed because we will need to generate IPs from different subnets than just the host subnet. Signed-off-by: Riccardo Ravaioli <[email protected]>

Signed-off-by: Riccardo Ravaioli <[email protected]>

The localnet is on a subnet different than the host subnet, the corresponding NAD is configured with a VLAN ID, the localnet pod uses an external router to communicate to cluster pods. Signed-off-by: Riccardo Ravaioli <[email protected]>

Signed-off-by: Riccardo Ravaioli <[email protected]>

In testing we saw how an invalid conntrack state would drop all echo requests after the first one. Let's send three pings in each test then. Signed-off-by: Riccardo Ravaioli <[email protected]>

Currently, we are force exiting with the trap before the background processes can end, container is removed and the orphaned processes end early causing our config to go into an unknown state because we dont end in an orderly manner. Wait until the pid file for ovnkube controller with node is removed which shows the process has completed. Signed-off-by: Martin Kennelly <[email protected]>

Prevent ovn-controller from sending stale GARP by adding drop flows on external bridge patch ports until ovnkube-controller synchronizes the southbound database - henceforth known as "drop flows". This addresses race conditions where ovn-controller processes outdated SB DB state before ovnkube-controller updates it, particularly affecting EIP SNAT configurations attached to logical router ports. Fixes: https://issues.redhat.com/browse/FDP-1537 ovnkube-controller controls the lifecycle of the drop flows. ovs / ovn-controller running is required to configure external bridge. Downstream, the external bridge maybe precreated and ovn-controller will use this. This fix considers three primary scenarios: node, container and pod restart. On Node restart means the ovs flows installed priotior to reboot on the node are cleared but the external bridge exists. Add the flows before ovnkube controller with node starts. The reason to add it here is that our gateway code depends on ovn-controller started and running... There is now a race here between ovn-controller starting (and garping) before we set this flow but I think the risk is low however it needs serious testing. The reason I did not naturally at the drop flows before ovn-controller started is because I have no way to detect if its a node reboot or pod reboot and i dont want to inject drop flows for simple ovn-controller container restart which could disrupt traffic. ovnkube-controller starts, we create a new gateway and apply flows the same flows in-order to ensure we always drop GARP when ovnkube controller hasn't sync. Remove the flows when ovnkube-controller has syncd. There is also a race here between ovnkube-controller removing the flows and ovn-controller GARPing with stale SB DB info. There is no easy way to detect what SB DB data ovn-controller has consumed. On Pod restart, we add the drop flows before exit. ovnkube-controller-with-node will also add it before it starts the go code. Container restart: - ovnkube-controller: adds flows upon start and exit - ovn-controller: no changes While the drop flows are set, OVN may not be able to resolve IPs it doesn't know about in its Logical Router pipelines generation. Following removal of the drop flows, OVN may resolve the IPs using GARP requests. OVN-Controller always sends out GARPs with op code 1 on startup. Signed-off-by: Martin Kennelly <[email protected]>

Use the golang image from quay.io/projectquay/golang due to issues pulling the image at quay.io/lib/golang Signed-off-by: Killian Muldoon <[email protected]>

Signed-off-by: Nadia Pinaeva <[email protected]>

Signed-off-by: Ram Lavi <[email protected]>

This PR is to add subnet overlap check with transit switch subnet for any UDN. Signed-off-by: Arnab Ghosh <[email protected]>

Lets fetch agnhost image from upstream and pin it to the k8s release we consume. Downstream, we approve k8 images. It eases adoption of E2Es. Signed-off-by: Martin Kennelly <[email protected]>

Required for downstream hack to detect deployment config. Signed-off-by: Martin Kennelly <[email protected]>

'name' isn't used downstream where as 'app' is. Also, some simple formatting issues on those files. There are other deployments that use 'name' but we don't use them downstream so I am not changing. Real fix is to use deployment config to expose what pods have what labels and then downstream we can write a deployment config override, however its complex idea and we use a different deployment model downtream that what we do upstream. There are many options upstream that we support and I dont care to harmonise them. Signed-off-by: Martin Kennelly <[email protected]>

Some new tests were added without a feature label. Signed-off-by: Martin Kennelly <[email protected]>

Since the last time I was here, here tests have been introduced with node selectors targetting KinD deployment. Remove that. There is another test here that still is tied to KinD deployment introduced recently but it needs a bigger refactor so im ignoring it for now. Signed-off-by: Martin Kennelly <[email protected]>

It is wrong to panic here. We should return error if we cannot find a deployment config for the target cluster. Signed-off-by: Martin Kennelly <[email protected]>

10.128.x.x is used intensively as providers network CIDR. 172.16.0.0/12 (rfc1918) is a private range network. Consume the upper bound range (172.31.x.x) of this private subnet to avoid potential collisions. Real fix is we provide tools for the tests to get CIDRs for UDNs that dont conflict with CDN, provider network. Signed-off-by: Martin Kennelly <[email protected]>

The EgressIP controller can deadlock when handling concurrent pod/EgressIP events that require locking multiple nodes in different orders. This occurs when: 1. Multiple EgressIP objects span different nodes. 2. Pods scheduled on non-EgressIP nodes match EgressIP labels. 3. Concurrent events lock EgressIP names first, then nodes in inconsistent order. Example deadlock scenario: - egressip-1 (node-1) matches app-2 (scheduled on node-2) - egressip-2 (node-2) matches app-1 (scheduled on node-1) - Concurrent processing locks nodes in different orders This blocks all event processing and causes pod admission failures. Solution: Lock nodes in lexicographical order to ensure consistent lock ordering and prevent deadlocks. This also includes unit test to verify the deadlock scenario and validate the fix. Signed-off-by: Periyasamy Palanisamy <[email protected]>

PR 5373 to drop the GARP flows didnt consider that we set the default network controller and later we set the gateway obj. In-between this period, ovnkube node may receive a stop signal and we do not guard against accessing the gateway if its not yet set. OVNKube controller may have sync'd before the gateway obj is set. There is nothing to reconcile if the gateway is not set. Signed-off-by: Martin Kennelly <[email protected]>

Signed-off-by: Enrique Llorente <[email protected]>

Add choosing tunnel key options. Signed-off-by: Nadia Pinaeva <[email protected]>

Fix other review comments that were not caught before. Change transit subnet from /30 to /31 Signed-off-by: Nadia Pinaeva <[email protected]>

Signed-off-by: Patryk Diak <[email protected]>

openshift-merge-robot · 2025-09-24T08:12:08Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2025-10-09T17:13:33Z

@pperiyasamy: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-metal-ipi-ovn-dualstack	`40b0078`	link	true	`/test e2e-metal-ipi-ovn-dualstack`
ci/prow/qe-perfscale-aws-ovn-small-udn-density-churn-l3	`40b0078`	link	false	`/test qe-perfscale-aws-ovn-small-udn-density-churn-l3`
ci/prow/e2e-aws-ovn-hypershift-kubevirt	`40b0078`	link	false	`/test e2e-aws-ovn-hypershift-kubevirt`
ci/prow/e2e-gcp-ovn-techpreview	`40b0078`	link	true	`/test e2e-gcp-ovn-techpreview`
ci/prow/4.20-upgrade-from-stable-4.19-e2e-gcp-ovn-rt-upgrade	`40b0078`	link	true	`/test 4.20-upgrade-from-stable-4.19-e2e-gcp-ovn-rt-upgrade`
ci/prow/e2e-aws-ovn-upgrade-local-gateway	`40b0078`	link	true	`/test e2e-aws-ovn-upgrade-local-gateway`
ci/prow/e2e-openstack-ovn	`40b0078`	link	false	`/test e2e-openstack-ovn`
ci/prow/qe-perfscale-aws-ovn-small-udn-density-l3	`40b0078`	link	false	`/test qe-perfscale-aws-ovn-small-udn-density-l3`
ci/prow/security	`40b0078`	link	false	`/test security`
ci/prow/e2e-aws-ovn-hypershift-conformance-techpreview	`40b0078`	link	false	`/test e2e-aws-ovn-hypershift-conformance-techpreview`
ci/prow/okd-scos-e2e-aws-ovn	`40b0078`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/e2e-gcp-ovn	`40b0078`	link	true	`/test e2e-gcp-ovn`
ci/prow/lint	`40b0078`	link	true	`/test lint`
ci/prow/e2e-aws-ovn	`40b0078`	link	true	`/test e2e-aws-ovn`
ci/prow/4.21-upgrade-from-stable-4.20-images	`40b0078`	link	true	`/test 4.21-upgrade-from-stable-4.20-images`
ci/prow/4.21-upgrade-from-stable-4.20-e2e-aws-ovn-upgrade	`40b0078`	link	true	`/test 4.21-upgrade-from-stable-4.20-e2e-aws-ovn-upgrade`
ci/prow/4.21-upgrade-from-stable-4.20-e2e-gcp-ovn-rt-upgrade	`40b0078`	link	true	`/test 4.21-upgrade-from-stable-4.20-e2e-gcp-ovn-rt-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

jluhrsen and others added 5 commits September 18, 2025 14:29

A couple of minor fixes

3177db5

Rabbit found these during code review. Signed-off-by: Tim Rozet <[email protected]>

Updates UDN Isolation DBIDs

3003335

Moves ACL DBIDs from "Secondary" to "PrimaryUDN". Signed-off-by: Tim Rozet <[email protected]>

Update docs for UDN

5f77692

Signed-off-by: Tim Rozet <[email protected]>

pperiyasamy force-pushed the windows-ci-debug branch 3 times, most recently from 16ce7bd to cfa215b Compare September 19, 2025 07:01

trozet and others added 21 commits September 19, 2025 11:32

Openflow: drop in_port from IPv6 dispatch OF rule at prio=50

d2dd91a

We did this for IPv4 in 1448d5a, but forgot about IPv6. Signed-off-by: Riccardo Ravaioli <[email protected]>

E2E localnet: remove double import of ginkgo

8673a5b

Signed-off-by: Riccardo Ravaioli <[email protected]>

E2E: add test host -> localnet with IP in host subnet

7740348

We already tested localnet -> host, let's also cover connections initiated from the host. The localnet uses IPs in the same subnet as the host network. Signed-off-by: Riccardo Ravaioli <[email protected]>

E2E localnet: remove references to downstream bugs and stories

7b6d209

Signed-off-by: Riccardo Ravaioli <[email protected]>

E2E localnet: specify that the localnet uses IPs from host subnet

d7627bf

Signed-off-by: Riccardo Ravaioli <[email protected]>

E2E localnet: make IP request for localnet pod extensible

9583a36

This is needed because we will need to generate IPs from different subnets than just the host subnet. Signed-off-by: Riccardo Ravaioli <[email protected]>

E2E localnet: Fix requirement on number of schedulable nodes

49b13d5

Signed-off-by: Riccardo Ravaioli <[email protected]>

E2E localnet: host network -> localnet on VLAN with external router

d1e414b

Signed-off-by: Riccardo Ravaioli <[email protected]>

E2E localnet: localnet -> host network on VLAN with external router

f8f0192

Signed-off-by: Riccardo Ravaioli <[email protected]>

E2E localnet: send three pings instead of just one

c94408c

In testing we saw how an invalid conntrack state would drop all echo requests after the first one. Let's send three pings in each test then. Signed-off-by: Riccardo Ravaioli <[email protected]>

fix: swap quay.io golang image provider

4a71db0

Use the golang image from quay.io/projectquay/golang due to issues pulling the image at quay.io/lib/golang Signed-off-by: Killian Muldoon <[email protected]>

Update maintainers affiliation

578d9e5

Signed-off-by: Nadia Pinaeva <[email protected]>

[kind] Bump kubevirt/ipam-extensions version to v0.3.1

1e2f091

Signed-off-by: Ram Lavi <[email protected]>

Add UDN subnet overlap check with transit switch subnet

742c830

This PR is to add subnet overlap check with transit switch subnet for any UDN. Signed-off-by: Arnab Ghosh <[email protected]>

Fetch agnhost image from k8s upstream

a5f5f90

Lets fetch agnhost image from upstream and pin it to the k8s release we consume. Downstream, we approve k8 images. It eases adoption of E2Es. Signed-off-by: Martin Kennelly <[email protected]>

martinkennelly and others added 12 commits September 19, 2025 11:32

E2Es: pass kconfig to detect deployment conf

f050a22

Required for downstream hack to detect deployment config. Signed-off-by: Martin Kennelly <[email protected]>

E2E: fixup e2es without a feature label

5fed2d7

Some new tests were added without a feature label. Signed-off-by: Martin Kennelly <[email protected]>

E2Es: dont panic when unable to find a deployment config

2511d87

It is wrong to panic here. We should return error if we cannot find a deployment config for the target cluster. Signed-off-by: Martin Kennelly <[email protected]>

enhancement: Improve layer2 topology for VMs

ef374d0

Signed-off-by: Enrique Llorente <[email protected]>

layer2 router topology: describe upgrade

d01caeb

Add choosing tunnel key options. Signed-off-by: Nadia Pinaeva <[email protected]>

Rename ovn_cluster_router to transit_router

edbe977

Fix other review comments that were not caught before. Change transit subnet from /30 to /31 Signed-off-by: Nadia Pinaeva <[email protected]>

expand EgressIP limitations and solution

40b0078

Signed-off-by: Patryk Diak <[email protected]>

pperiyasamy force-pushed the windows-ci-debug branch from cfa215b to 40b0078 Compare September 19, 2025 09:32

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DNM] Troubleshoot e2e-aws-ovn-windows CI failure #2761

[DNM] Troubleshoot e2e-aws-ovn-windows CI failure #2761

Uh oh!

pperiyasamy commented Sep 18, 2025

Uh oh!

openshift-ci bot commented Sep 18, 2025

Uh oh!

openshift-merge-robot commented Sep 24, 2025

Uh oh!

openshift-ci bot commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

[DNM] Troubleshoot e2e-aws-ovn-windows CI failure #2761

Are you sure you want to change the base?

[DNM] Troubleshoot e2e-aws-ovn-windows CI failure #2761

Uh oh!

Conversation

pperiyasamy commented Sep 18, 2025

Uh oh!

openshift-ci bot commented Sep 18, 2025

Uh oh!

openshift-merge-robot commented Sep 24, 2025

Uh oh!

openshift-ci bot commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants