Add durable option to TriggerDagRunOperator to reconnect on retry#68936
Open
1fanwang wants to merge 4 commits into
Open
Add durable option to TriggerDagRunOperator to reconnect on retry#689361fanwang wants to merge 4 commits into
1fanwang wants to merge 4 commits into
Conversation
With wait_for_completion the trigger-and-wait runs in the task runner. A worker crash while polling makes the retry recompute a fresh run_id and trigger a duplicate child run (or fail with DagRunAlreadyExists), even though the run the first attempt started is healthy and still running. The opt-in durable flag persists the triggered run_id to task_state_store before polling, so the retry reconnects to the in-flight run instead of resubmitting.
eab9d05 to
001febf
Compare
This was referenced Jun 24, 2026
seanghaeli
approved these changes
Jun 24, 2026
seanghaeli
left a comment
Contributor
There was a problem hiding this comment.
Looks good, definitely a follow-up to handle deferrable=True will be useful
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
When
TriggerDagRunOperatorruns withwait_for_completion=True(synchronous,non-deferrable), a worker crash while it is polling turns into a duplicate.
On retry, the operator recomputes a fresh
run_id(with nological_date/trigger_run_idit derives one fromutcnow()), so the runner triggers asecond child run instead of reconnecting to the one the first attempt
already started — or, with a fixed
run_id, the retry fails withDagRunAlreadyExists. Either way a transient worker blip mid-wait eitherduplicates the downstream work or fails a task whose triggered run is healthy
and still running.
This is the same duplicate-job-on-retry problem
ResumableJobMixin(
durable=True, Airflow 3.3) solves for submit-and-poll operators. The mixintargets operators whose
execute()does the submit+poll; on Airflow 3 thetrigger/wait happens in the task runner (the operator raises
DagRunTriggerException), so the same persist-then-reconnect contract isapplied where the wait actually lives.
What
Add an opt-in
durableflag toTriggerDagRunOperator(defaultFalse, nobehavior change). When
durable=Trueand waiting synchronously:run_idis persisted totask_state_storebefore pollingstarts;
allowed state;
unreadable.
Scoped to the synchronous wait; the deferrable path is unchanged and left for a
follow-up. Crash recovery is silently disabled when
task_state_storeisunavailable (degrades to today's behavior).
Tests
running prior run, short-circuit on an already-succeeded prior run, resubmit
after a failed prior run.
TriggerDagRunOperatortests are untouched and pass — opt-in meansno behavior change on the default path.
End-to-end (live, Breeze built from
main)A parent
TriggerDagRunOperator(wait_for_completion=True)triggers a child Dagthat sleeps; the parent's worker is
SIGKILLed while it polls, and the schedulerretries the task.
durable=False(today)durable=True(this PR)Raw before / after
Risk
Opt-in, so the default path is unchanged. The new path adds two
task_state_storeround-trips (one read, one write) per durable run, only whendurable=True. A run deleted between attempts is treated as "resubmit fresh".The submit→persist gap is the same small window documented on
ResumableJobMixin:a crash between triggering and persisting falls back to a fresh trigger on retry.
Open question
Default is
Falseto preserve behavior;ResumableJobMixindefaultsdurable=True.Worth deciding whether this should eventually default on for synchronous waits.
Was generative AI tooling used to co-author this PR?
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.