Significant delay in endpoint IP address propagation from Envoy Gateway to Envoy Proxy during backend rollout

*Description*:
>What issue is being seen? Describe what should be happening instead of
the bug, for example: The expected value isn't returned, etc.

We have observed a significant delay in the propagation of backend Pod IP addresses from Envoy Gateway to Envoy Proxy during rolling restarts of said backend service. In some cases, we have seen it takes between 30-120s to propagate the IP addresses of **new** Pods to Envoy Proxy pods, meaning there is a delay before which Envoy itself starts routing traffic to new Pods.

The timeline looks as follows:

- A rollout restart is triggered of a backend Deployment
- Several new Pods come up (depending on rollout strategy of the backend Deployment) and become `Ready`
- (Immediately) The EndpointSlice for the Deployment is updated with the appropriate conditions for the new pods (e.g. ready/serving/terminating) and their IP addresses
  - This can be seen in the Kubernetes API
- (Immediately) The Envoy Gateway controller picks up the change via the Watch API and adjusts its own internal representation of the EndpointSlice to match
  - This can be seen by inspecting the `resources.endpointSlices` as returned by `$ENVOY_GATEWAY_POD_IP:19000/api/config_dump?resource=all`
- (After 30-120s) The endpoint config in the Envoy Proxy pod is updated with the new Pod IP addresses
  - This can be observed with `"egctl config envoy-proxy endpoint`

Given Envoy Gateway's own config dump near-immediately reflects the changes in endpoints, the delay lies somewhere in the xDS/translation layer between Envoy Gateway and the programmed Envoy Pods.

If a rollout restart happens too quickly, and all old Pods are replaced with new Pods before the updated IP addresses are propagated to Envoy Proxy Pods, then Envoy will be left unable to route any traffic as its only representation of Backend IP addresses is for Pods that no longer exist, even though both the Kubernetes API and Envoy Gateway almost immediately update with the new endpoint IPs. The only way to work around this race condition seems to be to significantly slow down rollouts with Deployment strategies and preStop hooks, however 30s seems far too long for such a race condition.

*Repro steps*:
> Include sample requests, environment, etc. All data and inputs
required to reproduce the bug.

>**Note**: If there are privacy concerns, sanitize the data prior to
sharing.

Debug script to generate a visualisation of rollout of IP addresses is here: https://gist.github.com/coro/ab3638bd83c16adccb01d98625d5a395

While running this script, perform a rollout restart of a Deployment targeted by an Envoy Gateway HTTPRoute.

Note that we have tended to see this behaviour on specific Deployments with many (6+) replicas, and it is exacerbated by performing multiple restarts back to back (i.e. more Pod churn).

**EDIT**:  As @duizabojul found in #7358, looks like this can actually also occur in single-replica Deployments, so perhaps not impacted by replica count.

*Environment*:
>Include the environment like gateway version, envoy version and so on.

Seen across multiple versions, most recently:

EG `v1.5.3`
Envoy Proxy `envoy:distroless-v1.35.3`
EKS v1.32
Network topology: 2*Gateway, each with 6 Proxy Pods

<img width="1170" height="642" alt="Image" src="https://github.com/user-attachments/assets/eac58761-53ad-4989-8e16-e3a2eadecd57" />

*Logs*:
>Include the access logs and the Envoy logs.

Output of the [debug script](https://gist.github.com/coro/ab3638bd83c16adccb01d98625d5a395) is attached, showing the rollout of the IP addresses programmed in Envoy. Header meanings:
- `ENVOY_HEALTH_STATUS` - health of endpoint as measured in the Envoy Proxy Pod
- `READY`/`SERVING`/`TERMINATING` - endpoint condition as reported by the EndpointSlice
- `EG_READY`/`EG_SERVING`/`EG_TERMINATING` - endpoint condition as reported by Envoy Gateway controller config dump
- `RED_DURATION` - how long an endpoint had been in a 'red' state (which we define as `Ready` and `Serving` according to the EndpointSlice but not `HEALTHY` in Envoy Proxy

Output: [out.txt](https://github.com/user-attachments/files/23209610/out.txt)

These logs show in this case delays of up to 30s for specific endpoints.

As discussed with @arkodg on Slack the Envoy Gateway controller logs are too large to attach here.

*Related*:
#4997
#7358

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Significant delay in endpoint IP address propagation from Envoy Gateway to Envoy Proxy during backend rollout #7366

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Significant delay in endpoint IP address propagation from Envoy Gateway to Envoy Proxy during backend rollout #7366

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions