Use bounded DBDagBag cache in scheduler by Mady356 · Pull Request #69007 · apache/airflow

Mady356 · 2026-06-26T01:43:57Z

Use bounded DBDagBag caching in the scheduler.

The scheduler was creating DBDagBag(load_op_links=False) without cache settings, which meant it used the default unbounded dictionary for cached deserialized DAGs. This PR makes the scheduler use the existing DBDagBag LRU/TTL cache support instead.

Changes:

Added [scheduler] dag_cache_size and [scheduler] dag_cache_ttl config options.
Updated SchedulerJobRunner to pass those config values into DBDagBag.
Added a stats_prefix parameter to DBDagBag so scheduler cache metrics are emitted under scheduler.dag_bag instead of the API server prefix.
Added scheduler DagBag cache metrics to the metrics template.
Handles negative config values safely by falling back to unbounded cache / disabled TTL behavior.
Updated the DBDagBag class docstring to reflect that callers can enable bounded caching.
Added a unit test confirming the scheduler passes the cache config into DBDagBag.

closes: #69001

Testing:

python -m py_compile airflow-core/src/airflow/models/dagbag.py
python -m py_compile airflow-core/src/airflow/jobs/scheduler_job_runner.py
pytest airflow-core/tests/unit/jobs/test_scheduler_job.py::TestSchedulerJob::test_scheduler_dag_bag_uses_scheduler_cache_config
pytest airflow-core/tests/unit/models/test_dagbag.py

<!--
Thank you for contributing!

Please provide above a brief description of the changes made in this pull request.
Write a good git commit message following this guide: http://chris.beams.io/posts/git-commit/

Please make sure that your code changes are covered with tests.
And in case of new features or big changes remember to adjust the documentation.

For user-facing UI changes, please attach before/after screenshots (or a short
screen recording) so reviewers can assess the visual impact.

Feel free to ping (in general) for the review if you do not see reaction for a few days
(72 Hours is the minimum reaction time you can expect from volunteers) - we sometimes miss notifications.

In case of an existing issue, reference it using one of the following:

* closes: #ISSUE
* related: #ISSUE
-->

---

##### Was generative AI tooling used to co-author this PR?

<!--
If generative AI tooling has been used in the process of authoring this PR, please
change below checkbox to `[X]` followed by the name of the tool, uncomment the "Generated-by".
-->

- [ ] Yes (please specify the tool below)

<!--
Generated-by: [Tool Name] following [the guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions)
-->

---

* Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)** for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
* For fundamental code changes, an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals)) is needed.
* When adding dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
* For significant user-facing changes create newsfragment: `{pr_number}.significant.rst`, in [airflow-core/newsfragments](https://github.com/apache/airflow/tree/main/airflow-core/newsfragments). You can add this file in a follow-up commit after the PR is created so you know the PR number.

pierrejeambrun

LGTM, a few nits / suggestions, but nothing blocking

pierrejeambrun · 2026-06-26T12:20:01Z

+      version_added: 3.3.0
+      type: integer
+      example: ~
+      default: "1024"


This default might be low. That's the number of versions accross dags. Having more than 1024 dags can be common. Trading the memory issue for repeated db read.

It's configurable anyway, but maybe a bigger default can be better suited.

pierrejeambrun · 2026-06-26T12:22:37Z

+        dag_cache_size = conf.getint("scheduler", "dag_cache_size", fallback=1024)
+        dag_cache_ttl_config = conf.getint("scheduler", "dag_cache_ttl", fallback=10800)


Therer is already a default in the conf. I wouldn't put a fallback too.

Suggested change

dag_cache_size = conf.getint("scheduler", "dag_cache_size", fallback=1024)

dag_cache_ttl_config = conf.getint("scheduler", "dag_cache_ttl", fallback=10800)

dag_cache_size = conf.getint("scheduler", "dag_cache_size")

dag_cache_ttl_config = conf.getint("scheduler", "dag_cache_ttl")

pierrejeambrun · 2026-06-26T12:23:43Z

+        dag_cache_ttl_config = conf.getint("scheduler", "dag_cache_ttl", fallback=10800)

-        self.scheduler_dag_bag = DBDagBag(load_op_links=False)
+        if dag_cache_size < 0:


Minor inconsistency: API server uses cache_size <= 0 → unbounded; scheduler uses < 0 → warn+0, == 0 → unbounded. Both end at "0 = unbounded," so behavior matches, just expressed differently.

pierrejeambrun · 2026-06-26T12:24:11Z

            assert scheduler_job.heartrate == heartrate

+    @patch("airflow.jobs.scheduler_job_runner.DBDagBag")
+    def test_scheduler_dag_bag_uses_scheduler_cache_config(self, mock_db_dag_bag):


Missing test for < 0 for cache size -> raise warning + unbounded

kaxil · 2026-06-27T00:06:13Z

+        self.scheduler_dag_bag = DBDagBag(
+            load_op_links=False,
+            cache_size=dag_cache_size,
+            cache_ttl=dag_cache_ttl,
+            stats_prefix="scheduler.dag_bag",


This enables the bounded LRU/TTL cache for the scheduler, but the scheduler's access pattern is the one case where a count-based cap backfires, so I think the approach needs a rethink before merge.

#60804 deliberately left the scheduler on the unbounded dict (its description: "The scheduler continues using a plain unbounded dict with zero lock overhead") and enabled the bounded cache for the API server only, because the access profiles differ:

get_dag_for_run() is called for essentially every running DagRun the scheduler processes.

DagRun.get_running_dag_runs_to_examine() orders by last_scheduling_decision (least-recently-scheduled first), so across consecutive loops the scheduler round-robins through all running runs. The per-loop lru_cache() wrappers only dedupe within a single loop; the persistent cross-loop cache is this DBDagBag.

A cyclic sweep over N distinct dag_version_ids against an LRU/TTL cache of maxsize M < N is the sequential-flooding case: each key is evicted just before its next access, so the hit rate collapses toward zero once N > M. Every miss then pays session.get(DagVersion, ..., joinedload(serialized_dag)) plus a full SerializedDAG deserialization on the scheduler hot path. The deployments that hit this OOM are the large ones where active versions exceed 1024, so as written the default puts them straight into that regime. (Same concern Pierre raised on the default, but it's really about the eviction mechanism, not just the number.)

The leak here is superseded dag_version_ids accumulating: once a version stops being referenced by running runs, it's never looked up again. TTL eviction targets exactly those, while the refresh-on-revalidation write-back in _get_dag keeps the hot active set resident regardless of its size. So TTL-driven eviction (default dag_cache_size=0, or a safety-valve cap well above realistic active-version counts, with the TTL doing the real bounding) fixes the reported growth without the thrash. Note too that a count cap bounds cardinality, not bytes -- 1024 large serialized DAGs can still be hundreds of MB, whereas memray measured bytes retained.

Minor, worth noting in the description: enabling the cache also flips the scheduler from nullcontext to a real RLock per _get_dag, which #60804 explicitly chose to avoid. Cheap when uncontended, so not blocking, but it reverses a documented decision.

Hi @kaxil. I agree with your analysis for the scheduler use case. Before opening #69001, I went back and read through the design decisions in #60804 because I expected we should first discuss the right approach before implementing (or not) anything.

One thing I wasn't completely sure about from that discussion is whether the intended solution for the scheduler was simply to rely on num_runs and periodically restart the scheduler. If that's the recommended mitigation, I think it would be worth documenting somewhere, since it's not immediately obvious.

If we do want to introduce cache eviction for the scheduler, defaulting dag_cache_size to 0 (unbounded) and relying only on TTL eviction seems like a conservative choice. That addresses the leak of superseded dag_version_id while avoiding the LRU thrashing concerns you described for large deployments.

Regarding the RLock overhead mentioned in #60804, I'm not sure whether the performance impact is significant enough in practice to justify keeping an unbounded dict in the scheduler forever.

That makes a lot of sense and I understand why a low count-based LRU default is risky for the scheduler: if the active dag_version_id working set is larger than the cache size, the scheduler can end up thrashing and repeatedly paying the DB read/deserialization cost.

My current thinking is to revise this so scheduler eviction is TTL-driven by default instead of size-cap driven. Concretely, that would mean defaulting scheduler dag_cache_size to 0, keeping dag_cache_ttl set, and updating DBDagBag so a non-zero TTL can evict entries even when there is no count-based maxsize. That should target the superseded dag_version_id growth without evicting the active cyclic working set.

I’d also remove the redundant conf fallbacks and add tests for the non-positive cache size / TTL path.

Does that direction sound reasonable before I rework the PR?

ephraimbuddy · 2026-07-01T14:14:07Z

            cache_size = len(self._dags)
        if self._use_cache:
-            stats.gauge("api_server.dag_bag.cache_size", cache_size, rate=0.1)
+            stats.gauge(f"{self._stats_prefix}.cache_size", cache_size, rate=0.1)


Changing the calls to f"{self._stats_prefix}.cache_*" makes check_metrics_synced_with_the_registry see {_stats_prefix}.cache_hit, {_stats_prefix}.cache_miss, etc. as missing from the registry. The runtime metric split makes sense, but the metric names need to stay representable to the registry checker, or the checker/registry needs to support this pattern.

Use bounded DBDagBag cache in scheduler

1adf575

Mady356 requested review from XD-DENG and ashb as code owners June 26, 2026 01:43

boring-cyborg Bot added area:ConfigTemplates area:Scheduler including HA (high availability) scheduler labels Jun 26, 2026

aeroyorch suggested changes Jun 26, 2026

View reviewed changes

Comment thread airflow-core/src/airflow/jobs/scheduler_job_runner.py

aeroyorch reviewed Jun 26, 2026

View reviewed changes

Comment thread airflow-core/src/airflow/config_templates/config.yml Outdated

Add scheduler DAG bag cache metrics

4029da4

Mady356 requested review from amoghrajesh and potiuk as code owners June 26, 2026 08:24

aeroyorch suggested changes Jun 26, 2026

View reviewed changes

Comment thread airflow-core/src/airflow/models/dagbag.py

Set API server DAG bag stats prefix explicitly

5b2d41b

Mady356 requested review from bugraoz93, choo121600, ephraimbuddy, jason810496, pierrejeambrun, rawwar and shubhamraj-git as code owners June 26, 2026 10:07

pierrejeambrun added this to the Airflow 3.3.1 milestone Jun 26, 2026

aeroyorch approved these changes Jun 26, 2026

View reviewed changes

pierrejeambrun approved these changes Jun 26, 2026

View reviewed changes

kaxil reviewed Jun 27, 2026

View reviewed changes

ephraimbuddy reviewed Jul 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use bounded DBDagBag cache in scheduler#69007

Use bounded DBDagBag cache in scheduler#69007
Mady356 wants to merge 3 commits into
apache:mainfrom
Mady356:fix-dbdagbag-cache-eviction

Mady356 commented Jun 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pierrejeambrun left a comment

Uh oh!

pierrejeambrun Jun 26, 2026

Uh oh!

pierrejeambrun Jun 26, 2026

Uh oh!

pierrejeambrun Jun 26, 2026

Uh oh!

pierrejeambrun Jun 26, 2026

Uh oh!

kaxil Jun 27, 2026

Uh oh!

aeroyorch Jun 27, 2026 •

edited

Loading

Uh oh!

Mady356 Jun 27, 2026

Uh oh!

ephraimbuddy Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		dag_cache_size = conf.getint("scheduler", "dag_cache_size", fallback=1024)
		dag_cache_ttl_config = conf.getint("scheduler", "dag_cache_ttl", fallback=10800)

Uh oh!

Conversation

Mady356 commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pierrejeambrun left a comment

Choose a reason for hiding this comment

Uh oh!

pierrejeambrun Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

pierrejeambrun Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

pierrejeambrun Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

pierrejeambrun Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

kaxil Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

aeroyorch Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mady356 Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

ephraimbuddy Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Mady356 commented Jun 26, 2026 •

edited

Loading

aeroyorch Jun 27, 2026 •

edited

Loading