You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
K8S system tests fail intermittently when Docker Hub anonymous-pull rate limits are exhausted. The Helm chart's postgresql subchart uses bitnamilegacy/postgresql:16.1.0-debian-11-r15, which is pulled by containerd inside Kind at pod scheduling time — unauthenticated and without retry. When the runner IP's 100-pull/6h quota is spent, PostgreSQL never starts and all Airflow pods enter CrashLoopBackOff waiting for DB migrations.
PR #66423 added K8S_TEST_IMAGES_TO_PRELOAD to address this class of flake for alpine, busybox, and ubuntu images, but the postgresql image — the most critical one since all Airflow components depend on it — was not included.
How to reproduce
Non-deterministic. Depends on how many CI jobs share the runner IP within Docker Hub's 6-hour window. Evidence from two unrelated PRs:
5/6 K8S system test jobs passed, 1 failed (KubernetesExecutor-3.10-v1.30.13-true)
Same executor+python+K8S version as a passing job (KubernetesExecutor-3.10-v1.30.13-false passed)
Error:
ErrImagePull: failed to pull and unpack image "docker.io/bitnamilegacy/postgresql:16.1.0-debian-11-r15":
429 Too Many Requests - Server message: toomanyrequests: You have reached your unauthenticated pull rate limit.
35/36 K8S system test jobs passed, 1 failed (CeleryExecutor-3.11-v1.31.12-true)
Failed at the very first "Cleanup repo" step (docker run bash) before any test code ran:
docker: Error response from daemon: Head "https://registry-1.docker.io/v2/library/bash/manifests/latest":
net/http: TLS handshake timeout
Main branch run 25461521992 (same day): all 6 K8S jobs passed — confirming the failure is non-deterministic, not a regression.
What you think should happen instead?
The bitnamilegacy/postgresql:16.1.0-debian-11-r15 image should be included in K8S_TEST_IMAGES_TO_PRELOAD (added by PR #66423). The mechanism already exists:
Host-side docker pull with retry-on-429
kind load docker-image into cluster nodes
Kubelet finds the image locally (imagePullPolicy: IfNotPresent because the tag is pinned)
This is the same proven pattern that already protects alpine:3.23, busybox:1.37, and ubuntu:24.04.
Fix PR:#66507 (all 6 K8S system tests pass with this change)
Frequency: Intermittent — observed ~1 out of 6 K8S jobs failing per run when rate-limited.
Separate issue — bash:latest in "Cleanup repo" step:
The K8S workflow (and 10+ other workflows) uses docker run ... bash -c "rm -rf /workspace/*" as its first step. This pulls library/bash:latest from Docker Hub unauthenticated. The TLS timeout in PR #65840 hit this step. This is a broader problem (not K8S-specific) and should be tracked separately — possible fix is replacing with sudo rm -rf in a shell step.
Under which category would you file this issue?
Helm chart
Apache Airflow version
main (3.0.0.dev0) — affects CI K8S system tests
What happened and how to reproduce it?
Problem
K8S system tests fail intermittently when Docker Hub anonymous-pull rate limits are exhausted. The Helm chart's postgresql subchart uses
bitnamilegacy/postgresql:16.1.0-debian-11-r15, which is pulled by containerd inside Kind at pod scheduling time — unauthenticated and without retry. When the runner IP's 100-pull/6h quota is spent, PostgreSQL never starts and all Airflow pods enter CrashLoopBackOff waiting for DB migrations.PR #66423 added
K8S_TEST_IMAGES_TO_PRELOADto address this class of flake foralpine,busybox, andubuntuimages, but the postgresql image — the most critical one since all Airflow components depend on it — was not included.How to reproduce
Non-deterministic. Depends on how many CI jobs share the runner IP within Docker Hub's 6-hour window. Evidence from two unrelated PRs:
PR test: verify K8S CI runner Docker Hub connectivity #66420 — a one-line comment change to
k8s-tests.yml(cannot cause functional failure):KubernetesExecutor-3.10-v1.30.13-true)KubernetesExecutor-3.10-v1.30.13-falsepassed)PR Add sphinx-airflow-theme as uv workspace member #65840 — sphinx theme workspace (no K8S code changes):
CeleryExecutor-3.11-v1.31.12-true)docker run bash) before any test code ran:Main branch run 25461521992 (same day): all 6 K8S jobs passed — confirming the failure is non-deterministic, not a regression.
What you think should happen instead?
The
bitnamilegacy/postgresql:16.1.0-debian-11-r15image should be included inK8S_TEST_IMAGES_TO_PRELOAD(added by PR #66423). The mechanism already exists:docker pullwith retry-on-429kind load docker-imageinto cluster nodesimagePullPolicy: IfNotPresentbecause the tag is pinned)This is the same proven pattern that already protects
alpine:3.23,busybox:1.37, andubuntu:24.04.Fix PR: #66507 (all 6 K8S system tests pass with this change)
Operating System
Ubuntu (GitHub Actions runner)
Deployment
Official Apache Airflow Helm Chart
Apache Airflow Provider(s)
No response
Versions of Apache Airflow Providers
No response
Official Helm Chart version
main (development)
Kubernetes Version
v1.30.13, v1.31.12 (both observed failing)
Helm Chart configuration
Default
chart/values.yaml:Docker Image customizations
Not Applicable
Anything else?
Frequency: Intermittent — observed ~1 out of 6 K8S jobs failing per run when rate-limited.
Separate issue —
bash:latestin "Cleanup repo" step:The K8S workflow (and 10+ other workflows) uses
docker run ... bash -c "rm -rf /workspace/*"as its first step. This pullslibrary/bash:latestfrom Docker Hub unauthenticated. The TLS timeout in PR #65840 hit this step. This is a broader problem (not K8S-specific) and should be tracked separately — possible fix is replacing withsudo rm -rfin a shell step.Related:
K8S_TEST_IMAGES_TO_PRELOADmechanism (merged)bitnamilegacy/postgresql(user-facing, same image)Are you willing to submit PR?
Code of Conduct