Skip to content

Add Execution API endpoints for callback lifecycle and token swap#66141

Open
wjddn279 wants to merge 4 commits intoapache:mainfrom
wjddn279:add-callback-lifecycle-api
Open

Add Execution API endpoints for callback lifecycle and token swap#66141
wjddn279 wants to merge 4 commits intoapache:mainfrom
wjddn279:add-callback-lifecycle-api

Conversation

@wjddn279
Copy link
Copy Markdown
Contributor

@wjddn279 wjddn279 commented Apr 30, 2026

Problems

1. Workload token leaks into callback supervisor's runtime API calls

related #60108, When the executor enqueues an ExecuteCallback, the supervisor receives a long-lived workload-scoped token (intended only as a one-time admission ticket, like for ExecuteTask). However, unlike ExecuteTask — which immediately swaps it for an execution-scoped token via PATCH /task-instances/{id}/run — supervise_callback never performs the swap. It carries the workload token straight through into runtime API calls (e.g. Connection.get(), Variable.get()).

2026-04-29T11:08:35.766821Z [warning  ] Token type not allowed for endpoint [airflow.api_fastapi.execution_api.security] allowed_types=['execution'] correlation_id=019dd8ed-2e34-7b0d-bfba-d7fe16612411 loc=security.py:174 path=/execution/connections/slack_default token_scope=workload
2026-04-29T11:08:35.767466Z [info     ] request finished               [http.access] client_addr=172.18.0.7:47640 duration_us=1634 loc=http_access_log.py:98 method=GET path=/execution/connections/slack_default query= status_code=403

These shared endpoints only accept execution-scoped tokens, so any callback that touches Connections or Variables hits 403 from require_auth's scope check.

2. process_executor_events is the only writer of callback state, and it's executor-specific

Today, callback state transitions (QUEUED → RUNNING → SUCCESS/FAILED) are written only by the scheduler's _process_executor_events, which reads from each executor's in-process event_buffer. It does not work for executors that don't use the scheduler's in-process event channel — most notably EdgeExecutor, where workers run on remote machines and report back via HTTP. Their callbacks stay forever in QUEUED.

Solution

Move callback state transitions to the Execution API, mirroring the ExecuteTask pattern:

  • POST /callbacks/{id}/run — accepts both workload and execution tokens; transitions QUEUED → RUNNING and returns a fresh execution token via the Refreshed-API-Token header (same swap mechanism PR Two-token mechanism for task execution to prevent token expiration while tasks wait in executor queues #60108 introduced for TI). All subsequent callback API calls use the execution token.
  • PATCH /callbacks/{id}/state — execution-only; records the terminal SUCCESS / FAILED outcome with optional output.
  • supervise_callback calls client.callbacks.start() before forking the subprocess and client.callbacks.finish() afterward, so the API is the single source of truth for callback
    state — works uniformly across all executors (Local, Celery, Kubernetes, Edge).
  • _process_executor_events keeps its callback handling as a forward-only fallback for cases where the supervisor crashes before reporting a terminal state; it never regresses a state already advanced by the API.

Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)
    claude code (opus 4.7) to make test codes and ci green

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

@boring-cyborg boring-cyborg Bot added area:API Airflow's REST/HTTP API area:Scheduler including HA (high availability) scheduler area:task-sdk labels Apr 30, 2026
@anishgirianish
Copy link
Copy Markdown
Contributor

anishgirianish commented Apr 30, 2026

@wjddn279 Hey! Confirmed the regression is real and on me, generate_token flipped to workload scope at the base class but supervise_callback never got the paired swap. Thanks for catching it.

Quick thought before you go deeper on the PR: this is now the third place we'd be hand-rolling the workload→execution swap (tasks /run, your callback /run, and #62343's connection-test PATCH). Wonder if it's worth lifting it into a generic POST /auth/exchange + a transparent swap in the SDK client, so resource endpoints stay execution-only and future workload types inherit the fix automatically. Happy to chat / draft if it sounds reasonable.

Would love to hear @ashb @amoghrajesh @kaxil thoughts on this as well

@wjddn279
Copy link
Copy Markdown
Contributor Author

@anishgirianish

So you're suggesting we separate out only the API that exchanges the token into its own? Considering that workload types will likely grow, that sounds like a reasonable idea. But right now, we're tightening the security of the token swap by extremely restricting when the API can succeed, right? (e.g., only allowing it to succeed on the queued → running state transition.) Would that still be possible if we generalize it?

@anishgirianish
Copy link
Copy Markdown
Contributor

anishgirianish commented Apr 30, 2026

@anishgirianish

So you're suggesting we separate out only the API that exchanges the token into its own? Considering that workload types will likely grow, that sounds like a reasonable idea. But right now, we're tightening the security of the token swap by extremely restricting when the API can succeed, right? (e.g., only allowing it to succeed on the queued → running state transition.) Would that still be possible if we generalize it?

Yeah, thinking about it again,  you're right, the QUEUED → RUNNING gate gives us single-use structurally, and that's worth keeping, I think, maybe abstraction is not worth it as of now.Thanks for pointing out

Comment thread airflow-core/src/airflow/jobs/scheduler_job_runner.py Outdated
Comment thread airflow-core/src/airflow/api_fastapi/execution_api/routes/callbacks.py Outdated
Comment thread airflow-core/src/airflow/api_fastapi/execution_api/routes/callbacks.py Outdated
Comment thread task-sdk/src/airflow/sdk/api/datamodels/_generated.py
Comment thread airflow-core/src/airflow/jobs/scheduler_job_runner.py Outdated
Comment thread task-sdk/tests/task_sdk/execution_time/test_callback_supervisor.py Outdated
@wjddn279 wjddn279 force-pushed the add-callback-lifecycle-api branch from 7709b17 to 4af312d Compare May 5, 2026 13:22
@potiuk
Copy link
Copy Markdown
Member

potiuk commented May 5, 2026

@wjddn279 — Your unresolved review thread(s) from @amoghrajesh appear to have been addressed (post-review commits and/or in-thread replies on every thread, with the latest commit pushed after the most recent thread). I've added the ready for maintainer review label so the PR re-enters the maintainer review queue.

@amoghrajesh — could you take another look when you have a chance? If you agree the feedback was addressed, please mark the threads as resolved so the queue signal stays accurate. If a thread still needs work, please reply in-line — @wjddn279 will follow up.


Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.

@potiuk potiuk added the ready for maintainer review Set after triaging when all criteria pass. label May 5, 2026
@wjddn279 wjddn279 force-pushed the add-callback-lifecycle-api branch 2 times, most recently from f4f09ae to 1325cbb Compare May 7, 2026 00:05
@wjddn279 wjddn279 force-pushed the add-callback-lifecycle-api branch from 1325cbb to 137b659 Compare May 7, 2026 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:API Airflow's REST/HTTP API area:Scheduler including HA (high availability) scheduler area:task-sdk ready for maintainer review Set after triaging when all criteria pass.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants