Skip to content

Requeue KubernetesExecutor tasks whose pod failed before execution started#69058

Open
seanmuth wants to merge 2 commits into
apache:mainfrom
seanmuth:feature/k8s-executor-pre-execution-requeue
Open

Requeue KubernetesExecutor tasks whose pod failed before execution started#69058
seanmuth wants to merge 2 commits into
apache:mainfrom
seanmuth:feature/k8s-executor-pre-execution-requeue

Conversation

@seanmuth

Copy link
Copy Markdown
Contributor

When a worker pod is destroyed before the task process starts — a node drain, autoscaler scale-down, node boot race, or transient image pull failure — the task instance is still in queued state and no task code has run. Today the KubernetesExecutor reports this to the scheduler as a normal FAILED, which consumes a user-configured task retry and raises a misleading failure alert for work that never executed.

This adds a transparent, executor-level requeue for that case. In _change_state, a pod that reports FAILED while its task instance is still QUEUED is requeued onto the existing task_queue (the same mechanism task_publish_max_retries already uses for pod creation failures) without writing to the event buffer, so the scheduler never observes the failure and no task-level retry is consumed.

Behavior is bounded and configurable:

  • pod_launch_failure_retries (default 1, -1 unlimited, 0 disables) — how many times a task is transparently requeued before failing normally.
  • pod_launch_failure_excluded_container_reasons (default Error) — container reasons that opt out of the requeue path and consume a normal retry instead. The default excludes Error, which covers a container that started executing but whose worker process exited before writing running to the DB (most likely an Airflow-specific startup error rather than a transient infrastructure event).

The ti_state == QUEUED check is the authoritative signal: a task that was actually executing would already have transitioned to running, so OOM-kills and other mid-execution failures are unaffected. Deferrable-operator resume pods are covered for free — when the triggerer fires, the TI returns to queued, so a resume pod killed before execute_complete starts is requeued rather than discarding already-completed external work.

closes: #69052


Was generative AI tooling used to co-author this PR?
  • Yes — Claude Code (Opus 4.8)

Generated-by: Claude Code (Opus 4.8) following the guidelines

…arted

When a worker pod is destroyed before the task process starts (node drain, autoscaler scale-down, node boot race, transient image pull failure), the task instance is still queued and no task code has run. Reporting this to the scheduler as a normal failure consumes a user-configured retry and raises a misleading failure alert for work that never executed. The executor already has the signal to tell this apart from an execution failure, so it now transparently requeues the pod without consuming a task retry, bounded by pod_launch_failure_retries and excluding container reasons in pod_launch_failure_excluded_container_reasons (default Error).
@boring-cyborg boring-cyborg Bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Jun 26, 2026
@seanmuth

Copy link
Copy Markdown
Contributor Author

Working on a test confirmation in Astro Hosted now

The tests replaced executor.task_queue with a MagicMock, so the executor.end() teardown looped forever on get_nowait() instead of raising Empty, hitting the 60s CI timeout. Assert the requeue via observable executor state instead of mocking the queue, and pass a valid executor_config to the stash test so execute_async does not bail before recording the job.
Comment on lines +501 to +506
if state == TaskInstanceState.FAILED and self._is_pre_execution_failure(
state,
self._get_task_instance_state(key, session=session),
failure_details,
self.pod_launch_failure_excluded_container_reasons,
):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if state == TaskInstanceState.FAILED and self._is_pre_execution_failure(
state,
self._get_task_instance_state(key, session=session),
failure_details,
self.pod_launch_failure_excluded_container_reasons,
):
if (
state == TaskInstanceState.FAILED
and self.pod_launch_failure_max_retries != 0
and self._is_pre_execution_failure(
state,
self._get_task_instance_state(key, session=session),
failure_details,
self.pod_launch_failure_excluded_container_reasons,
)
):

This saves some db queries when no retries are allowed. Not a big deal, I think? (It shouldn’t be too common to explicitly disallow retries.)

Comment on lines 471 to +484
if state == ADOPTED:
# When the task pod is adopted by another executor,
# then remove the task from the current executor running queue.
self.last_known_jobs.pop(key, None)
try:
self.running.remove(key)
except KeyError:
self.log.debug("TI key not in running: %s", key)
return

if state == TaskInstanceState.RUNNING:
# The task process started, so any later failure is an execution failure that should
# not be requeued by the pre-execution path below.
self.last_known_jobs.pop(key, None)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should probably also clean pod_launch_failure_max_retries?

I wonder if we should wrap the containers into an object so they are always handled together.

@uranusjr

Copy link
Copy Markdown
Member

We should add a release note for this. Especially since this changes the default behavior—all pods are now retried once by default instead of directly fail.

We should probably also call out somewhere that if you set delete_worker_pods_on_failure = false (the default!) things may accumulate if you set retries to -1 (indefinite). You can break the system quite badly with say a misconfigured image that crashes immediately on launch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KubernetesExecutor: automatically requeue tasks whose pod failed before execution started

2 participants