Requeue KubernetesExecutor tasks whose pod failed before execution started by seanmuth · Pull Request #69058 · apache/airflow

seanmuth · 2026-06-26T20:20:54Z

When a worker pod is destroyed before the task process starts — a node drain, autoscaler scale-down, node boot race, or transient image pull failure — the task instance is still in queued state and no task code has run. Today the KubernetesExecutor reports this to the scheduler as a normal FAILED, which consumes a user-configured task retry and raises a misleading failure alert for work that never executed.

This adds a transparent, executor-level requeue for that case. In _change_state, a pod that reports FAILED while its task instance is still QUEUED is requeued onto the existing task_queue (the same mechanism task_publish_max_retries already uses for pod creation failures) without writing to the event buffer, so the scheduler never observes the failure and no task-level retry is consumed.

Behavior is bounded and configurable:

pod_launch_failure_retries (default 1, -1 unlimited, 0 disables) — how many times a task is transparently requeued before failing normally.
pod_launch_failure_excluded_container_reasons (default Error) — container reasons that opt out of the requeue path and consume a normal retry instead. The default excludes Error, which covers a container that started executing but whose worker process exited before writing running to the DB (most likely an Airflow-specific startup error rather than a transient infrastructure event).

The ti_state == QUEUED check is the authoritative signal: a task that was actually executing would already have transitioned to running, so OOM-kills and other mid-execution failures are unaffected. Deferrable-operator resume pods are covered for free — when the triggerer fires, the TI returns to queued, so a resume pod killed before execute_complete starts is requeued rather than discarding already-completed external work.

closes: #69052

Was generative AI tooling used to co-author this PR?

Yes — Claude Code (Opus 4.8)

Generated-by: Claude Code (Opus 4.8) following the guidelines

…arted When a worker pod is destroyed before the task process starts (node drain, autoscaler scale-down, node boot race, transient image pull failure), the task instance is still queued and no task code has run. Reporting this to the scheduler as a normal failure consumes a user-configured retry and raises a misleading failure alert for work that never executed. The executor already has the signal to tell this apart from an execution failure, so it now transparently requeues the pod without consuming a task retry, bounded by pod_launch_failure_retries and excluding container reasons in pod_launch_failure_excluded_container_reasons (default Error).

seanmuth · 2026-06-26T20:23:14Z

Working on a test confirmation in Astro Hosted now

The tests replaced executor.task_queue with a MagicMock, so the executor.end() teardown looped forever on get_nowait() instead of raising Empty, hitting the 60s CI timeout. Assert the requeue via observable executor state instead of mocking the queue, and pass a valid executor_config to the stash test so execute_async does not bail before recording the job.

uranusjr · 2026-06-26T22:56:56Z

+        if state == TaskInstanceState.FAILED and self._is_pre_execution_failure(
+            state,
+            self._get_task_instance_state(key, session=session),
+            failure_details,
+            self.pod_launch_failure_excluded_container_reasons,
+        ):


Suggested change

if state == TaskInstanceState.FAILED and self._is_pre_execution_failure(

state,

self._get_task_instance_state(key, session=session),

failure_details,

self.pod_launch_failure_excluded_container_reasons,

):

if (

state == TaskInstanceState.FAILED

and self.pod_launch_failure_max_retries != 0

and self._is_pre_execution_failure(

state,

self._get_task_instance_state(key, session=session),

failure_details,

self.pod_launch_failure_excluded_container_reasons,

)

):

This saves some db queries when no retries are allowed. Not a big deal, I think? (It shouldn’t be too common to explicitly disallow retries.)

uranusjr · 2026-06-26T22:59:35Z

        if state == ADOPTED:
            # When the task pod is adopted by another executor,
            # then remove the task from the current executor running queue.
+            self.last_known_jobs.pop(key, None)
            try:
                self.running.remove(key)
            except KeyError:
                self.log.debug("TI key not in running: %s", key)
            return

        if state == TaskInstanceState.RUNNING:
+            # The task process started, so any later failure is an execution failure that should
+            # not be requeued by the pre-execution path below.
+            self.last_known_jobs.pop(key, None)


These should probably also clean pod_launch_failure_max_retries?

I wonder if we should wrap the containers into an object so they are always handled together.

uranusjr · 2026-06-26T23:05:32Z

We should add a release note for this. Especially since this changes the default behavior—all pods are now retried once by default instead of directly fail.

We should probably also call out somewhere that if you set delete_worker_pods_on_failure = false (the default!) things may accumulate if you set retries to -1 (indefinite). You can break the system quite badly with say a misconfigured image that crashes immediately on launch.

seanmuth requested review from hussein-awala, jedcunningham and jscheffl as code owners June 26, 2026 20:20

boring-cyborg Bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Jun 26, 2026

uranusjr reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Requeue KubernetesExecutor tasks whose pod failed before execution started#69058

Requeue KubernetesExecutor tasks whose pod failed before execution started#69058
seanmuth wants to merge 2 commits into
apache:mainfrom
seanmuth:feature/k8s-executor-pre-execution-requeue

seanmuth commented Jun 26, 2026

Uh oh!

seanmuth commented Jun 26, 2026

Uh oh!

uranusjr Jun 26, 2026

Uh oh!

uranusjr Jun 26, 2026

Uh oh!

uranusjr commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

seanmuth commented Jun 26, 2026

Was generative AI tooling used to co-author this PR?

Uh oh!

seanmuth commented Jun 26, 2026

Uh oh!

uranusjr Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

uranusjr Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

uranusjr commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants