Add durable option to TriggerDagRunOperator to reconnect on retry by 1fanwang · Pull Request #68936 · apache/airflow

1fanwang · 2026-06-24T09:04:55Z

Why

When TriggerDagRunOperator runs with wait_for_completion=True (synchronous,
non-deferrable), a worker crash while it is polling turns into a duplicate.

On retry, the operator recomputes a fresh run_id (with no logical_date/
trigger_run_id it derives one from utcnow()), so the runner triggers a
second child run instead of reconnecting to the one the first attempt
already started — or, with a fixed run_id, the retry fails with
DagRunAlreadyExists. Either way a transient worker blip mid-wait either
duplicates the downstream work or fails a task whose triggered run is healthy
and still running.

This is the same duplicate-job-on-retry problem ResumableJobMixin
(durable=True, Airflow 3.3) solves for submit-and-poll operators. The mixin
targets operators whose execute() does the submit+poll; on Airflow 3 the
trigger/wait happens in the task runner (the operator raises
DagRunTriggerException), so the same persist-then-reconnect contract is
applied where the wait actually lives.

What

Add an opt-in durable flag to TriggerDagRunOperator (default False, no
behavior change). When durable=True and waiting synchronously:

the triggered run_id is persisted to task_state_store before polling
starts;
on retry the runner reads it back and:
- reconnects and resumes the wait if the run is still active;
- returns success without resubmitting if it already finished in an
  allowed state;
- triggers a fresh run if the prior one failed, is gone, or its state is
  unreadable.

Scoped to the synchronous wait; the deferrable path is unchanged and left for a
follow-up. Crash recovery is silently disabled when task_state_store is
unavailable (degrades to today's behavior).

Tests

Four unit tests covering the contract: persist-before-poll, reconnect to a
running prior run, short-circuit on an already-succeeded prior run, resubmit
after a failed prior run.
Existing TriggerDagRunOperator tests are untouched and pass — opt-in means
no behavior change on the default path.

End-to-end (live, Breeze built from `main`)

A parent TriggerDagRunOperator(wait_for_completion=True) triggers a child Dag
that sleeps; the parent's worker is SIGKILLed while it polls, and the scheduler
retries the task.

	child runs from one parent task	parent task on retry
`durable=False` (today)	2 — a duplicate child run	re-triggers a fresh run
`durable=True` (this PR)	1	reconnects to the in-flight run, succeeds

Raw before / after

BEFORE (durable=False) — worker killed mid-wait, scheduler retried (try_number=2):
  child dag runs:
    manual__2026-...T08:17:47Z  success   <- attempt 1
    manual__2026-...T08:19:46Z  success   <- attempt 2, DUPLICATE
  => 2 runs from one parent task

AFTER (durable=True) — same crash, try_number=2:
  attempt=2.log: "Reconnecting to run triggered on a prior attempt."
                 run_id=manual__2026-...T08:53:51Z
  parent run: success   (trigger_child succeeded on try_number=2)
  child dag runs:
    manual__2026-...T08:53:51Z  success   <- single run, reconnected
  => 1 run from one parent task

Risk

Opt-in, so the default path is unchanged. The new path adds two
task_state_store round-trips (one read, one write) per durable run, only when
durable=True. A run deleted between attempts is treated as "resubmit fresh".
The submit→persist gap is the same small window documented on ResumableJobMixin:
a crash between triggering and persisting falls back to a fresh trigger on retry.

Open question

Default is False to preserve behavior; ResumableJobMixin defaults durable=True.
Worth deciding whether this should eventually default on for synchronous waits.

Was generative AI tooling used to co-author this PR?

Yes (please specify the tool below)

Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
When adding dependency, check compliance with the ASF 3rd Party License Policy.
For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

With wait_for_completion the trigger-and-wait runs in the task runner. A worker crash while polling makes the retry recompute a fresh run_id and trigger a duplicate child run (or fail with DagRunAlreadyExists), even though the run the first attempt started is healthy and still running. The opt-in durable flag persists the triggered run_id to task_state_store before polling, so the retry reconnects to the in-flight run instead of resubmitting.

seanghaeli

Looks good, definitely a follow-up to handle deferrable=True will be useful

boring-cyborg Bot added area:providers area:task-sdk provider:standard labels Jun 24, 2026

1fanwang marked this pull request as ready for review June 24, 2026 18:38

1fanwang requested review from amoghrajesh, ashb and kaxil as code owners June 24, 2026 18:38

1fanwang added 4 commits June 24, 2026 11:43

Add newsfragment for TriggerDagRunOperator durable flag

ec3c791

Narrow task_state_store and coerce stored run id for mypy

c333423

Make the durable TriggerDagRunOperator newsfragment a single line

001febf

1fanwang force-pushed the durable-trigger-dagrun branch from eab9d05 to 001febf Compare June 24, 2026 18:43

This was referenced Jun 24, 2026

Make ResumableJobMixin's reconnect core reusable outside execute() #68952

Open

Reuse the resumable core for TriggerDagRunOperator's durable wait #68955

Open

seanghaeli approved these changes Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add durable option to TriggerDagRunOperator to reconnect on retry#68936

Add durable option to TriggerDagRunOperator to reconnect on retry#68936
1fanwang wants to merge 4 commits into
apache:mainfrom
1fanwang:durable-trigger-dagrun

1fanwang commented Jun 24, 2026 •

edited

Loading

Uh oh!

seanghaeli left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

1fanwang commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Tests

End-to-end (live, Breeze built from main)

Risk

Open question

Was generative AI tooling used to co-author this PR?

Uh oh!

seanghaeli left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1fanwang commented Jun 24, 2026 •

edited

Loading

End-to-end (live, Breeze built from `main`)