Skip to content

K8S system tests flake: postgresql image hit Docker Hub rate limit inside Kind #66511

@srchilukoori

Description

@srchilukoori

Under which category would you file this issue?

Helm chart

Apache Airflow version

main (3.0.0.dev0) — affects CI K8S system tests

What happened and how to reproduce it?

Problem

K8S system tests fail intermittently when Docker Hub anonymous-pull rate limits are exhausted. The Helm chart's postgresql subchart uses bitnamilegacy/postgresql:16.1.0-debian-11-r15, which is pulled by containerd inside Kind at pod scheduling time — unauthenticated and without retry. When the runner IP's 100-pull/6h quota is spent, PostgreSQL never starts and all Airflow pods enter CrashLoopBackOff waiting for DB migrations.

PR #66423 added K8S_TEST_IMAGES_TO_PRELOAD to address this class of flake for alpine, busybox, and ubuntu images, but the postgresql image — the most critical one since all Airflow components depend on it — was not included.

How to reproduce

Non-deterministic. Depends on how many CI jobs share the runner IP within Docker Hub's 6-hour window. Evidence from two unrelated PRs:

  1. PR test: verify K8S CI runner Docker Hub connectivity #66420 — a one-line comment change to k8s-tests.yml (cannot cause functional failure):

    • 5/6 K8S system test jobs passed, 1 failed (KubernetesExecutor-3.10-v1.30.13-true)
    • Same executor+python+K8S version as a passing job (KubernetesExecutor-3.10-v1.30.13-false passed)
    • Error:
    ErrImagePull: failed to pull and unpack image "docker.io/bitnamilegacy/postgresql:16.1.0-debian-11-r15":
    429 Too Many Requests - Server message: toomanyrequests: You have reached your unauthenticated pull rate limit.
    
  2. PR Add sphinx-airflow-theme as uv workspace member #65840 — sphinx theme workspace (no K8S code changes):

    • 35/36 K8S system test jobs passed, 1 failed (CeleryExecutor-3.11-v1.31.12-true)
    • Failed at the very first "Cleanup repo" step (docker run bash) before any test code ran:
    docker: Error response from daemon: Head "https://registry-1.docker.io/v2/library/bash/manifests/latest":
    net/http: TLS handshake timeout
    
  3. Main branch run 25461521992 (same day): all 6 K8S jobs passed — confirming the failure is non-deterministic, not a regression.

What you think should happen instead?

The bitnamilegacy/postgresql:16.1.0-debian-11-r15 image should be included in K8S_TEST_IMAGES_TO_PRELOAD (added by PR #66423). The mechanism already exists:

  1. Host-side docker pull with retry-on-429
  2. kind load docker-image into cluster nodes
  3. Kubelet finds the image locally (imagePullPolicy: IfNotPresent because the tag is pinned)

This is the same proven pattern that already protects alpine:3.23, busybox:1.37, and ubuntu:24.04.

Fix PR: #66507 (all 6 K8S system tests pass with this change)

Operating System

Ubuntu (GitHub Actions runner)

Deployment

Official Apache Airflow Helm Chart

Apache Airflow Provider(s)

No response

Versions of Apache Airflow Providers

No response

Official Helm Chart version

main (development)

Kubernetes Version

v1.30.13, v1.31.12 (both observed failing)

Helm Chart configuration

Default chart/values.yaml:

postgresql:
  enabled: true
  image:
    repository: bitnamilegacy/postgresql
    tag: "16.1.0-debian-11-r15"

Docker Image customizations

Not Applicable

Anything else?

Frequency: Intermittent — observed ~1 out of 6 K8S jobs failing per run when rate-limited.

Separate issue — bash:latest in "Cleanup repo" step:

The K8S workflow (and 10+ other workflows) uses docker run ... bash -c "rm -rf /workspace/*" as its first step. This pulls library/bash:latest from Docker Hub unauthenticated. The TLS timeout in PR #65840 hit this step. This is a broader problem (not K8S-specific) and should be tracked separately — possible fix is replacing with sudo rm -rf in a shell step.

Related:

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions