Skip to content

Reconnect to the running Livy batch on retry instead of resubmitting#68956

Open
1fanwang wants to merge 1 commit into
apache:mainfrom
1fanwang:durable-livy
Open

Reconnect to the running Livy batch on retry instead of resubmitting#68956
1fanwang wants to merge 1 commit into
apache:mainfrom
1fanwang:durable-livy

Conversation

@1fanwang

Copy link
Copy Markdown
Contributor

Why

LivyOperator waiting synchronously (deferrable=False with polling_interval > 0) holds the
Spark batch on the worker. If the worker is lost mid-poll, the retry posts a brand-new Livy
batch — the original Spark application keeps running and the work is duplicated.

ResumableJobMixin (Airflow 3.3, AIP-103) exists to make exactly this synchronous-wait path
crash-safe: persist the external job id before polling, and on retry reconnect to the running job
instead of resubmitting. Livy is a clean fit — a synchronous submit-then-poll operator, the same
shape the mixin was built for and the SparkSubmitOperator already uses.

What

LivyOperator now subclasses ResumableJobMixin and routes its synchronous-poll path through
execute_resumable:

  • submit_job posts the batch and returns its id; get_job_status / is_job_active /
    is_job_succeeded classify Livy BatchState; poll_until_complete reuses the existing
    poll_for_termination; get_job_result pushes the app_id XCom.
  • The batch id is persisted to task_state_store before polling, so a retry reads it back and
    reconnects to the running batch.
  • Deferrable (Triggerer owns the wait) and fire-and-forget (polling_interval=0, nothing
    to reconnect to) paths are untouched. An Airflow-2 stub keeps the provider importable on 2.x.

Crash-safety is opt-in through the mixin's durable flag (default on); set durable=False to keep
the always-resubmit behaviour.

Tests

A new TestLivyOperatorResumable suite (gated on Airflow 3.3+) covers fresh-submit-persists-before-poll,
the three retry decisions (reconnect / return / resubmit) across real BatchState values, graceful
degradation without a task_state_store, and durable=False. The existing LivyOperator suite is
unchanged.

End-to-end (live, Breeze)

A real worker crash during the synchronous wait, against an in-memory Livy stand-in that counts
POST /batches. A LivyOperator(durable=True, deferrable=False, polling_interval=3) submits a
batch; the worker is SIGKILLed mid-poll; the scheduler retries. Attempt 2 reads the persisted
batch id back, reconnects to the still-running batch, and finishes it — with no second submit.

Raw
attempt 1: POST /batches -> batch id 1, worker polling      (POST count = 1)
SIGKILL worker (pid 406)
attempt 2 (try_number=2):
   "Reconnecting to existing job"
   "Batch with id 1 terminated with state: success"
task run_batch: success
POST /batches count after both attempts: 1   <- single submit, reconnected; no duplicate batch

Risk

Only the synchronous-poll path changes; deferrable and fire-and-forget are byte-for-byte the same.
The reconnect logic is the shared mixin core, already covered by its own tests.


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.

When LivyOperator waits synchronously (deferrable=False with polling_interval > 0) and the
worker is lost mid-poll, the retry currently posts a brand-new Livy batch, leaving the original
Spark application running and duplicating the work. Subclass ResumableJobMixin so the batch id is
persisted before polling and the retry reconnects to the in-flight batch. Deferrable and
fire-and-forget (polling_interval=0) paths are unchanged.
@potiuk potiuk added the ready for maintainer review Set after triaging when all criteria pass. label Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants