AIP-103: Adding periodic task state garbage collection and retention support#66463
AIP-103: Adding periodic task state garbage collection and retention support#66463amoghrajesh wants to merge 2 commits intoapache:mainfrom
Conversation
| default: "airflow.state.metastore.MetastoreStateBackend" | ||
| default_retention_days: | ||
| description: | | ||
| Number of days to retain task state rows after their last update. |
There was a problem hiding this comment.
default_retention_days only applies to task_state rows. asset_state has no time-based GC — it is cleared explicitly (as per AIP) or via orphan cleanup when the asset is deregistered. Updated the config description to make this clear.
| Reads ``[state_store] default_retention_days`` from config and delegates to the | ||
| configured state backend. Runs on the interval set by ``[state_store] state_cleanup_interval``. | ||
| """ | ||
| retention_days = conf.getint("state_store", "default_retention_days") |
There was a problem hiding this comment.
if default_retention_days is set to 0, then we'll delete everything older than now, it doesn't look right 🤔
There was a problem hiding this comment.
Oops, fixed it! retention_days=0 now disables time based cleanup entirely. Refactored so MetastoreStateBackend.cleanup() reads default_retention_days from config directly rather than receiving it as an argument, felt cleaner since the backend is responsible for enforcing its own retention policy. Scheduler just calls cleanup() with no args.
jason810496
left a comment
There was a problem hiding this comment.
Would it be better to introduce batching / pagination for the task state garbage collection?
| older_than = now - timedelta(days=retention_days) if retention_days > 0 else None | ||
| with create_session() as session: | ||
| if older_than: | ||
| session.execute(delete(TaskStateModel).where(TaskStateModel.updated_at < older_than)) | ||
| session.execute( | ||
| delete(TaskStateModel).where( | ||
| TaskStateModel.expires_at.isnot(None), | ||
| TaskStateModel.expires_at < now, | ||
| ) | ||
| ) | ||
| active_asset_ids = select(AssetModel.id).join( | ||
| AssetActive, (AssetActive.name == AssetModel.name) & (AssetActive.uri == AssetModel.uri) | ||
| ) | ||
| session.execute(delete(AssetStateModel).where(AssetStateModel.asset_id.not_in(active_asset_ids))) | ||
|
|
There was a problem hiding this comment.
If it's a valid assumption that users might produce a large amount of state records here between the state_cleanup_interval time window. I have some concern regarding the single pass delete Transaction here.
I just double checked the concern with Claude:
Yes, this is a real problem. Several compounding issues:
- Missing indexes — every cleanup will be a full table scan
The two predicates the cleanup filters on are not indexed:
- task_state.updated_at — no index (only task_state_pkey on (dag_run_id,
task_id, map_index, key) and idx_task_state_lookup on (dag_id, run_id,
task_id, map_index)) - task_state.expires_at — no index (just added in this PR)
So both DELETE WHERE updated_at < cutoff and DELETE WHERE expires_at < now()
do full sequential scans. On a deployment with millions of rows that's minutes
of scanning every 24h, plus the locks held for the whole duration.
- No batching / no LIMIT
Compare to airflow db cleanup (utils/db_cleanup.py:217), which deletes in
configurable batches and commits between them. The new path runs three plain
bulk DELETEs in a single session. Long-running bulk DELETE means:
- Row locks held for the duration (writers calling task_state.set() upserts on
matching rows block — they queue behind the cleanup transaction). - On Postgres: massive WAL churn, autovacuum can't keep up, table bloat.
- On MySQL/InnoDB at REPEATABLE READ (Airflow's default): next-key/gap locks
make conflicts even more likely.
- All three DELETEs share one transaction
with create_session() as session: opens one session; each session.execute()
runs inside it; commit happens at exit. If pass 1 takes 90s, the locks from
pass 1 are held while pass 2 and pass 3 run. A failure in pass 3 rolls back
passes 1 and 2 (cleanup makes no forward progress at all).
- Scheduler main loop is blocked
_cleanup_expired_task_state is registered via call_regular_interval, which is
synchronous in the scheduler loop. Same pattern as
_remove_unreferenced_triggers and _update_asset_orphanage — but those have
small cardinality. task_state is user-driven and unbounded (the AIP encourages
users to write a lot of it). With a multi-minute cleanup the scheduler is not
scheduling for those minutes.
There was a problem hiding this comment.
Thanks, good catches.
I will address all of them except last now cos its invalid from scheduler perspective, its a cli command now
082d92d to
7dc826d
Compare
7dc826d to
b644ce6
Compare
| STATE_STORE_COMMANDS = ( | ||
| ActionCommand( | ||
| name="cleanup", | ||
| help="Remove expired task state rows via the configured state backend", |
There was a problem hiding this comment.
Nit:
| help="Remove expired task state rows via the configured state backend", | |
| help="Remove expired stored state via the configured state backend", |
| # even if updated_at is recent. NULL means no early expiry — the row is still cleaned | ||
| # up by the global updated_at + default_retention_days check. Populated via |
There was a problem hiding this comment.
This is confusing IMO -- an expires_at of None should mean it never expires.
We can pre-compute the expires_at value at update time by reading the default_retention config then (i.e. cleanup becomes a simpler "SELECT where expires_at < Now()`.
This possibly also removes the need for an index on udpated_at.
| pk_cols = ( | ||
| TaskStateModel.dag_run_id, | ||
| TaskStateModel.task_id, | ||
| TaskStateModel.map_index, | ||
| TaskStateModel.key, | ||
| ) |
There was a problem hiding this comment.
Given how this is used, it might be time to add a single column id pk (either integer, or uuid)
| def _delete_batched(where_clause) -> int: | ||
| total = 0 | ||
| while True: | ||
| with create_session() as session: |
There was a problem hiding this comment.
I don't think this should be a new session object each time around the loop, but instead one session object that is explicitly session.commit()ed after each batch.
closes: #66459
What?
Task state rows live as long as their parent DAG run. In deployments that don't run airflow db cleanup — or where task state should expire sooner than the DAG run — rows accumulate indefinitely. This PR adds an explicit retention mechanism independent of DAG run cleanup. To perform effective cleanup, following is needed:
task_staterows older than N daysasset_activeentry is deleted, butasset_staterows stay behind silentlyProposed change
expires_atcolumn ontask_state-updated_atalone can't distinguish a 7 day key from a 30 day key. NULL means fall back to the globaldefault_retention_days; set means delete after this timestamp regardless ofupdated_at. Settingdefault_retention_days = 0disables time-based cleanup entirely (expires_atcleanup still runs).BaseStateBackend.cleanup()no-op default — custom backends override this to implement their own retention policy. The backend reads[state_store] default_retention_daysfrom config itself since the AIP says "the backend is responsible for enforcing the retention policy."[state_store]:default_retention_days = 30(task_state only — does not affect asset_state) andclear_on_success = False.MetastoreStateBackend.cleanup()runs two passes for task_state: rows pastupdated_at + default_retention_dayscutoff, and rows withexpires_at < now().airflow state-store cleanupCLI command — callsget_state_backend().cleanup(). Operators schedule this via cron or a maintenance DAG. Supports--dry-run._update_asset_orphanage()— runs in the same pass as asset deregistration, which is when the orphans are created. This is the right home since it is an internal consistency operation, not a user-facing data lifecycle decision.Why a CLI command instead of the scheduler?
Running cleanup as a scheduler periodic task was considered but there will be concerns regarding performance to the scheduler because cleanup doesn't come without a time cost.
A dedicated CLI keeps the separation clean, schedule it where it makes sense for a deployment.
User implications / backcompat
New config options under
[state_store]with safe defaults — no action needed to maintain existing behaviour. Theexpires_atcolumn is nullable; existing rows getNULL(global default retention applies).Testing
Test setup
export AIRFLOW__STATE_STORE__DEFAULT_RETENTION_DAYS=1Testing for
updated_at/ global cleanupairflow state-store cleanupTesting for
expired_atairflow state-store cleanupcommand (note here that the row withexpired_atis cleared even thoughupdated_atis still not reaching its cleanup interval)Dry run:
[Breeze:3.10.20] root@8872db171dd2:/opt/airflow$ airflow state-store cleanup --dry-run 2026-05-07T12:13:34.018675Z [info ] setup plugin alembic.autogenerate.schemas [alembic.runtime.plugins] loc=plugins.py:37 2026-05-07T12:13:34.018817Z [info ] setup plugin alembic.autogenerate.tables [alembic.runtime.plugins] loc=plugins.py:37 2026-05-07T12:13:34.018870Z [info ] setup plugin alembic.autogenerate.types [alembic.runtime.plugins] loc=plugins.py:37 2026-05-07T12:13:34.018914Z [info ] setup plugin alembic.autogenerate.constraints [alembic.runtime.plugins] loc=plugins.py:37 2026-05-07T12:13:34.018995Z [info ] setup plugin alembic.autogenerate.defaults [alembic.runtime.plugins] loc=plugins.py:37 2026-05-07T12:13:34.019080Z [info ] setup plugin alembic.autogenerate.comments [alembic.runtime.plugins] loc=plugins.py:37 Would delete 2 task state row(s): Older than retention period (1): DAG 'my_dag', run 'manual__2026-05-07T11:46:58.479591+00:00', task 't1', key 'job_id_2' Per-key expiry reached (1): DAG 'my_dag', run 'manual__2026-05-07T11:46:58.479591+00:00', task 't1', key 'job_idWhat's next
clear_on_successhook: Clear task state on TI success #66460task_state.set(retention_days=N)API to populateexpires_atat write time: Add ability for Per task state key retention at operator level #66461Was generative AI tooling used to co-author this PR?
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.