-
Notifications
You must be signed in to change notification settings - Fork 595
Description
Description:
What issue is being seen? Describe what should be happening instead of
the bug, for example: The expected value isn't returned, etc.
We have observed a significant delay in the propagation of backend Pod IP addresses from Envoy Gateway to Envoy Proxy during rolling restarts of said backend service. In some cases, we have seen it takes between 30-120s to propagate the IP addresses of new Pods to Envoy Proxy pods, meaning there is a delay before which Envoy itself starts routing traffic to new Pods.
The timeline looks as follows:
- A rollout restart is triggered of a backend Deployment
- Several new Pods come up (depending on rollout strategy of the backend Deployment) and become
Ready - (Immediately) The EndpointSlice for the Deployment is updated with the appropriate conditions for the new pods (e.g. ready/serving/terminating) and their IP addresses
- This can be seen in the Kubernetes API
- (Immediately) The Envoy Gateway controller picks up the change via the Watch API and adjusts its own internal representation of the EndpointSlice to match
- This can be seen by inspecting the
resources.endpointSlicesas returned by$ENVOY_GATEWAY_POD_IP:19000/api/config_dump?resource=all
- This can be seen by inspecting the
- (After 30-120s) The endpoint config in the Envoy Proxy pod is updated with the new Pod IP addresses
- This can be observed with
"egctl config envoy-proxy endpoint
- This can be observed with
Given Envoy Gateway's own config dump near-immediately reflects the changes in endpoints, the delay lies somewhere in the xDS/translation layer between Envoy Gateway and the programmed Envoy Pods.
If a rollout restart happens too quickly, and all old Pods are replaced with new Pods before the updated IP addresses are propagated to Envoy Proxy Pods, then Envoy will be left unable to route any traffic as its only representation of Backend IP addresses is for Pods that no longer exist, even though both the Kubernetes API and Envoy Gateway almost immediately update with the new endpoint IPs. The only way to work around this race condition seems to be to significantly slow down rollouts with Deployment strategies and preStop hooks, however 30s seems far too long for such a race condition.
Repro steps:
Include sample requests, environment, etc. All data and inputs
required to reproduce the bug.
Note: If there are privacy concerns, sanitize the data prior to
sharing.
Debug script to generate a visualisation of rollout of IP addresses is here: https://gist.github.com/coro/ab3638bd83c16adccb01d98625d5a395
While running this script, perform a rollout restart of a Deployment targeted by an Envoy Gateway HTTPRoute.
Note that we have tended to see this behaviour on specific Deployments with many (6+) replicas, and it is exacerbated by performing multiple restarts back to back (i.e. more Pod churn).
EDIT: As @duizabojul found in #7358, looks like this can actually also occur in single-replica Deployments, so perhaps not impacted by replica count.
Environment:
Include the environment like gateway version, envoy version and so on.
Seen across multiple versions, most recently:
EG v1.5.3
Envoy Proxy envoy:distroless-v1.35.3
EKS v1.32
Network topology: 2*Gateway, each with 6 Proxy Pods
Logs:
Include the access logs and the Envoy logs.
Output of the debug script is attached, showing the rollout of the IP addresses programmed in Envoy. Header meanings:
ENVOY_HEALTH_STATUS- health of endpoint as measured in the Envoy Proxy PodREADY/SERVING/TERMINATING- endpoint condition as reported by the EndpointSliceEG_READY/EG_SERVING/EG_TERMINATING- endpoint condition as reported by Envoy Gateway controller config dumpRED_DURATION- how long an endpoint had been in a 'red' state (which we define asReadyandServingaccording to the EndpointSlice but notHEALTHYin Envoy Proxy
Output: out.txt
These logs show in this case delays of up to 30s for specific endpoints.
As discussed with @arkodg on Slack the Envoy Gateway controller logs are too large to attach here.