Requeue KubernetesExecutor tasks whose pod failed before execution started#69058
Requeue KubernetesExecutor tasks whose pod failed before execution started#69058seanmuth wants to merge 2 commits into
Conversation
…arted When a worker pod is destroyed before the task process starts (node drain, autoscaler scale-down, node boot race, transient image pull failure), the task instance is still queued and no task code has run. Reporting this to the scheduler as a normal failure consumes a user-configured retry and raises a misleading failure alert for work that never executed. The executor already has the signal to tell this apart from an execution failure, so it now transparently requeues the pod without consuming a task retry, bounded by pod_launch_failure_retries and excluding container reasons in pod_launch_failure_excluded_container_reasons (default Error).
|
Working on a test confirmation in Astro Hosted now |
The tests replaced executor.task_queue with a MagicMock, so the executor.end() teardown looped forever on get_nowait() instead of raising Empty, hitting the 60s CI timeout. Assert the requeue via observable executor state instead of mocking the queue, and pass a valid executor_config to the stash test so execute_async does not bail before recording the job.
| if state == TaskInstanceState.FAILED and self._is_pre_execution_failure( | ||
| state, | ||
| self._get_task_instance_state(key, session=session), | ||
| failure_details, | ||
| self.pod_launch_failure_excluded_container_reasons, | ||
| ): |
There was a problem hiding this comment.
| if state == TaskInstanceState.FAILED and self._is_pre_execution_failure( | |
| state, | |
| self._get_task_instance_state(key, session=session), | |
| failure_details, | |
| self.pod_launch_failure_excluded_container_reasons, | |
| ): | |
| if ( | |
| state == TaskInstanceState.FAILED | |
| and self.pod_launch_failure_max_retries != 0 | |
| and self._is_pre_execution_failure( | |
| state, | |
| self._get_task_instance_state(key, session=session), | |
| failure_details, | |
| self.pod_launch_failure_excluded_container_reasons, | |
| ) | |
| ): |
This saves some db queries when no retries are allowed. Not a big deal, I think? (It shouldn’t be too common to explicitly disallow retries.)
| if state == ADOPTED: | ||
| # When the task pod is adopted by another executor, | ||
| # then remove the task from the current executor running queue. | ||
| self.last_known_jobs.pop(key, None) | ||
| try: | ||
| self.running.remove(key) | ||
| except KeyError: | ||
| self.log.debug("TI key not in running: %s", key) | ||
| return | ||
|
|
||
| if state == TaskInstanceState.RUNNING: | ||
| # The task process started, so any later failure is an execution failure that should | ||
| # not be requeued by the pre-execution path below. | ||
| self.last_known_jobs.pop(key, None) |
There was a problem hiding this comment.
These should probably also clean pod_launch_failure_max_retries?
I wonder if we should wrap the containers into an object so they are always handled together.
|
We should add a release note for this. Especially since this changes the default behavior—all pods are now retried once by default instead of directly fail. We should probably also call out somewhere that if you set |
When a worker pod is destroyed before the task process starts — a node drain, autoscaler scale-down, node boot race, or transient image pull failure — the task instance is still in
queuedstate and no task code has run. Today the KubernetesExecutor reports this to the scheduler as a normalFAILED, which consumes a user-configured task retry and raises a misleading failure alert for work that never executed.This adds a transparent, executor-level requeue for that case. In
_change_state, a pod that reportsFAILEDwhile its task instance is stillQUEUEDis requeued onto the existingtask_queue(the same mechanismtask_publish_max_retriesalready uses for pod creation failures) without writing to the event buffer, so the scheduler never observes the failure and no task-level retry is consumed.Behavior is bounded and configurable:
pod_launch_failure_retries(default1,-1unlimited,0disables) — how many times a task is transparently requeued before failing normally.pod_launch_failure_excluded_container_reasons(defaultError) — container reasons that opt out of the requeue path and consume a normal retry instead. The default excludesError, which covers a container that started executing but whose worker process exited before writingrunningto the DB (most likely an Airflow-specific startup error rather than a transient infrastructure event).The
ti_state == QUEUEDcheck is the authoritative signal: a task that was actually executing would already have transitioned torunning, so OOM-kills and other mid-execution failures are unaffected. Deferrable-operator resume pods are covered for free — when the triggerer fires, the TI returns toqueued, so a resume pod killed beforeexecute_completestarts is requeued rather than discarding already-completed external work.closes: #69052
Was generative AI tooling used to co-author this PR?
Generated-by: Claude Code (Opus 4.8) following the guidelines