Skip to content

Delayed endpoint updates for specific HTTPRoute causing 503 errors #7358

@duizabojul

Description

@duizabojul

Description

We're experiencing a critical issue with Envoy Gateway where endpoint updates are delayed by 5-10 minutes for a specific HTTPRoute (Keycloak), resulting in 503 errors during deployments. This issue is reproducible only with this particular HTTPRoute, while other HTTPRoutes in our cluster work correctly.

Environment

  • Kubernetes Version: GKE v1.32 with dataplane v2
  • Envoy Gateway Version: v1.5.4
  • Cluster Setup:
    • 2 Envoy Gateway instances
    • 2 proxy replicas per gateway
    • 1 GKE native gateway (same namespace)
  • Routing Mode: Endpoints (issue doesn't occur with Service mode, but that's not a viable solution for us)

Observed Behavior

During deployment rollouts of the Keycloak application, Envoy proxies receive endpoint updates with significant delays (5-10 minutes), causing:

  • Incorrect proxy state
  • 503 errors for incoming requests
  • Traffic being routed to terminated pods

We monitored this using:

viddy 'egctl config envoy-proxy endpoint -n gateways -l gateway.envoyproxy.io/owning-gateway-name=my-public-envoy | jq -S "
  .gateways |= with_entries(
    .value.dynamicEndpointConfigs |= sort_by(.endpointConfig.clusterName) |
    .value.staticEndpointConfigs |= sort_by(.endpointConfig.clusterName)
  )
"'

Expected Behavior

Endpoint updates should propagate to Envoy proxies within seconds (similar to other HTTPRoutes in the cluster), ensuring zero-downtime deployments.

Manifests

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: keycloak
  namespace: keycloak
spec:
  hostnames:
    - keycloak.my.domain
  parentRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: my-public-envoy
      namespace: gateways
      sectionName: https
  rules:
    - backendRefs:
        - group: ''
          kind: Service
          name: keycloak
          port: 8080
          weight: 1
      matches:
        - path:
            type: PathPrefix
            value: /realms/
        - path:
            type: PathPrefix
            value: /resources/
        - path:
            type: Exact
            value: /robots.txt
status:
  parents:
    - conditions:
        - lastTransitionTime: '2025-10-27T16:23:15Z'
          message: Route is accepted
          observedGeneration: 3
          reason: Accepted
          status: 'True'
          type: Accepted
        - lastTransitionTime: '2025-10-27T16:23:15Z'
          message: Resolved all the Object references for the Route
          observedGeneration: 3
          reason: ResolvedRefs
          status: 'True'
          type: ResolvedRefs
      controllerName: gateway.envoyproxy.io/gatewayclass-controller
      parentRef:
        group: gateway.networking.k8s.io
        kind: Gateway
        name: my-public-envoy
        namespace: gateways
        sectionName: https
apiVersion: v1
kind: Service
metadata:
  name: keycloak
  namespace: keycloak
spec:
  clusterIP: 10.3.12.238
  clusterIPs:
    - 10.3.12.238
  internalTrafficPolicy: Cluster
  ipFamilies:
    - IPv4
  ipFamilyPolicy: SingleStack
  ports:
    - name: http
      port: 8080
      protocol: TCP
      targetPort: http
  selector:
    app.kubernetes.io/instance: keycloak
    app.kubernetes.io/name: keycloak
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/instance: keycloak
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: keycloak
    app.kubernetes.io/version: 26.0.5
    helm.sh/chart: keycloak-0.1.0
  name: keycloak
  namespace: keycloak
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: keycloak
      app.kubernetes.io/name: keycloak
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      labels:
        app.kubernetes.io/instance: keycloak
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: keycloak
        app.kubernetes.io/version: 26.0.5
        helm.sh/chart: keycloak-0.1.0
    spec:
      containers:
        - args:
            - start
            - '--verbose'
            - '--log-console-output=json'
            - '--log-level=INFO,org.keycloak:DEBUG'
            - '--features=user-event-metrics,client-secret-rotation'
          env:
            - name: STAKATER_KEYCLOAK_CONFIGMAP
              value: 4e6c49a1c79cdff645dce8afc232d95bc93155b4
            - name: KC_DB_USERNAME
              valueFrom:
                secretKeyRef:
                  key: LOGIN
                  name: keycloak-db
            - name: KC_DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  key: PASSWORD
                  name: keycloak-db
            - name: KC_DB_URL_HOST
              valueFrom:
                secretKeyRef:
                  key: HOST
                  name: keycloak-db
            - name: KC_DB_URL_PORT
              valueFrom:
                secretKeyRef:
                  key: PORT
                  name: keycloak-db
          envFrom:
            - secretRef:
                name: keycloak-env
            - configMapRef:
                name: keycloak
          image: europe-docker.pkg.dev/my-company/docker/keycloak:main-2ac8399
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /health/live
              port: management
              scheme: HTTPS
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          name: keycloak
          ports:
            - containerPort: 8080
              name: http
              protocol: TCP
            - containerPort: 9000
              name: management
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /health/ready
              port: management
              scheme: HTTPS
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              memory: 3Gi
            requests:
              cpu: 100m
              memory: 1Gi
          securityContext: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /certs/public
              name: public-tls
              readOnly: true
            - mountPath: /certs/private
              name: private-tls
              readOnly: true
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: keycloak
      serviceAccountName: keycloak
      terminationGracePeriodSeconds: 30
      volumes:
        - name: public-tls
          secret:
            defaultMode: 420
            secretName: keycloak-public-tls
        - name: private-tls
          secret:
            defaultMode: 420
            secretName: keycloak-private-tls
status:
  availableReplicas: 1
  conditions:
    - lastTransitionTime: '2025-10-22T15:47:45Z'
      lastUpdateTime: '2025-10-22T15:47:45Z'
      message: Deployment has minimum availability.
      reason: MinimumReplicasAvailable
      status: 'True'
      type: Available
    - lastTransitionTime: '2025-09-24T19:04:05Z'
      lastUpdateTime: '2025-10-28T10:07:22Z'
      message: ReplicaSet "keycloak-7f64f656bf" has successfully progressed.
      reason: NewReplicaSetAvailable
      status: 'True'
      type: Progressing
  observedGeneration: 26
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

Any guidance on debugging or resolving this issue would be greatly appreciated!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions