Skip to content

Use bounded DBDagBag cache in scheduler#69007

Open
Mady356 wants to merge 3 commits into
apache:mainfrom
Mady356:fix-dbdagbag-cache-eviction
Open

Use bounded DBDagBag cache in scheduler#69007
Mady356 wants to merge 3 commits into
apache:mainfrom
Mady356:fix-dbdagbag-cache-eviction

Conversation

@Mady356

@Mady356 Mady356 commented Jun 26, 2026

Copy link
Copy Markdown

Use bounded DBDagBag caching in the scheduler.

The scheduler was creating DBDagBag(load_op_links=False) without cache settings, which meant it used the default unbounded dictionary for cached deserialized DAGs. This PR makes the scheduler use the existing DBDagBag LRU/TTL cache support instead.

Changes:

  • Added [scheduler] dag_cache_size and [scheduler] dag_cache_ttl config options.
  • Updated SchedulerJobRunner to pass those config values into DBDagBag.
  • Added a stats_prefix parameter to DBDagBag so scheduler cache metrics are emitted under scheduler.dag_bag instead of the API server prefix.
  • Added scheduler DagBag cache metrics to the metrics template.
  • Handles negative config values safely by falling back to unbounded cache / disabled TTL behavior.
  • Updated the DBDagBag class docstring to reflect that callers can enable bounded caching.
  • Added a unit test confirming the scheduler passes the cache config into DBDagBag.

closes: #69001

Testing:

python -m py_compile airflow-core/src/airflow/models/dagbag.py
python -m py_compile airflow-core/src/airflow/jobs/scheduler_job_runner.py
pytest airflow-core/tests/unit/jobs/test_scheduler_job.py::TestSchedulerJob::test_scheduler_dag_bag_uses_scheduler_cache_config
pytest airflow-core/tests/unit/models/test_dagbag.py

<!--
Thank you for contributing!

Please provide above a brief description of the changes made in this pull request.
Write a good git commit message following this guide: http://chris.beams.io/posts/git-commit/

Please make sure that your code changes are covered with tests.
And in case of new features or big changes remember to adjust the documentation.

For user-facing UI changes, please attach before/after screenshots (or a short
screen recording) so reviewers can assess the visual impact.

Feel free to ping (in general) for the review if you do not see reaction for a few days
(72 Hours is the minimum reaction time you can expect from volunteers) - we sometimes miss notifications.

In case of an existing issue, reference it using one of the following:

* closes: #ISSUE
* related: #ISSUE
-->

---

##### Was generative AI tooling used to co-author this PR?

<!--
If generative AI tooling has been used in the process of authoring this PR, please
change below checkbox to `[X]` followed by the name of the tool, uncomment the "Generated-by".
-->

- [ ] Yes (please specify the tool below)

<!--
Generated-by: [Tool Name] following [the guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions)
-->

---

* Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)** for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
* For fundamental code changes, an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals)) is needed.
* When adding dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
* For significant user-facing changes create newsfragment: `{pr_number}.significant.rst`, in [airflow-core/newsfragments](https://github.com/apache/airflow/tree/main/airflow-core/newsfragments). You can add this file in a follow-up commit after the PR is created so you know the PR number.

@Mady356 Mady356 requested review from XD-DENG and ashb as code owners June 26, 2026 01:43
@boring-cyborg boring-cyborg Bot added area:ConfigTemplates area:Scheduler including HA (high availability) scheduler labels Jun 26, 2026
Comment thread airflow-core/src/airflow/jobs/scheduler_job_runner.py
Comment thread airflow-core/src/airflow/config_templates/config.yml Outdated
Comment thread airflow-core/src/airflow/models/dagbag.py

@pierrejeambrun pierrejeambrun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a few nits / suggestions, but nothing blocking

version_added: 3.3.0
type: integer
example: ~
default: "1024"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default might be low. That's the number of versions accross dags. Having more than 1024 dags can be common. Trading the memory issue for repeated db read.

It's configurable anyway, but maybe a bigger default can be better suited.

Comment on lines +350 to +351
dag_cache_size = conf.getint("scheduler", "dag_cache_size", fallback=1024)
dag_cache_ttl_config = conf.getint("scheduler", "dag_cache_ttl", fallback=10800)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therer is already a default in the conf. I wouldn't put a fallback too.

Suggested change
dag_cache_size = conf.getint("scheduler", "dag_cache_size", fallback=1024)
dag_cache_ttl_config = conf.getint("scheduler", "dag_cache_ttl", fallback=10800)
dag_cache_size = conf.getint("scheduler", "dag_cache_size")
dag_cache_ttl_config = conf.getint("scheduler", "dag_cache_ttl")

dag_cache_ttl_config = conf.getint("scheduler", "dag_cache_ttl", fallback=10800)

self.scheduler_dag_bag = DBDagBag(load_op_links=False)
if dag_cache_size < 0:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor inconsistency: API server uses cache_size <= 0 → unbounded; scheduler uses < 0 → warn+0, == 0 → unbounded. Both end at "0 = unbounded," so behavior matches, just expressed differently.

assert scheduler_job.heartrate == heartrate

@patch("airflow.jobs.scheduler_job_runner.DBDagBag")
def test_scheduler_dag_bag_uses_scheduler_cache_config(self, mock_db_dag_bag):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test for < 0 for cache size -> raise warning + unbounded

Comment on lines +363 to +367
self.scheduler_dag_bag = DBDagBag(
load_op_links=False,
cache_size=dag_cache_size,
cache_ttl=dag_cache_ttl,
stats_prefix="scheduler.dag_bag",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enables the bounded LRU/TTL cache for the scheduler, but the scheduler's access pattern is the one case where a count-based cap backfires, so I think the approach needs a rethink before merge.

#60804 deliberately left the scheduler on the unbounded dict (its description: "The scheduler continues using a plain unbounded dict with zero lock overhead") and enabled the bounded cache for the API server only, because the access profiles differ:

  • get_dag_for_run() is called for essentially every running DagRun the scheduler processes.
  • DagRun.get_running_dag_runs_to_examine() orders by last_scheduling_decision (least-recently-scheduled first), so across consecutive loops the scheduler round-robins through all running runs. The per-loop lru_cache() wrappers only dedupe within a single loop; the persistent cross-loop cache is this DBDagBag.

A cyclic sweep over N distinct dag_version_ids against an LRU/TTL cache of maxsize M < N is the sequential-flooding case: each key is evicted just before its next access, so the hit rate collapses toward zero once N > M. Every miss then pays session.get(DagVersion, ..., joinedload(serialized_dag)) plus a full SerializedDAG deserialization on the scheduler hot path. The deployments that hit this OOM are the large ones where active versions exceed 1024, so as written the default puts them straight into that regime. (Same concern Pierre raised on the default, but it's really about the eviction mechanism, not just the number.)

The leak here is superseded dag_version_ids accumulating: once a version stops being referenced by running runs, it's never looked up again. TTL eviction targets exactly those, while the refresh-on-revalidation write-back in _get_dag keeps the hot active set resident regardless of its size. So TTL-driven eviction (default dag_cache_size=0, or a safety-valve cap well above realistic active-version counts, with the TTL doing the real bounding) fixes the reported growth without the thrash. Note too that a count cap bounds cardinality, not bytes -- 1024 large serialized DAGs can still be hundreds of MB, whereas memray measured bytes retained.

Minor, worth noting in the description: enabling the cache also flips the scheduler from nullcontext to a real RLock per _get_dag, which #60804 explicitly chose to avoid. Cheap when uncontended, so not blocking, but it reverses a documented decision.

@aeroyorch aeroyorch Jun 27, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kaxil. I agree with your analysis for the scheduler use case. Before opening #69001, I went back and read through the design decisions in #60804 because I expected we should first discuss the right approach before implementing (or not) anything.

One thing I wasn't completely sure about from that discussion is whether the intended solution for the scheduler was simply to rely on num_runs and periodically restart the scheduler. If that's the recommended mitigation, I think it would be worth documenting somewhere, since it's not immediately obvious.

If we do want to introduce cache eviction for the scheduler, defaulting dag_cache_size to 0 (unbounded) and relying only on TTL eviction seems like a conservative choice. That addresses the leak of superseded dag_version_id while avoiding the LRU thrashing concerns you described for large deployments.

Regarding the RLock overhead mentioned in #60804, I'm not sure whether the performance impact is significant enough in practice to justify keeping an unbounded dict in the scheduler forever.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes a lot of sense and I understand why a low count-based LRU default is risky for the scheduler: if the active dag_version_id working set is larger than the cache size, the scheduler can end up thrashing and repeatedly paying the DB read/deserialization cost.

My current thinking is to revise this so scheduler eviction is TTL-driven by default instead of size-cap driven. Concretely, that would mean defaulting scheduler dag_cache_size to 0, keeping dag_cache_ttl set, and updating DBDagBag so a non-zero TTL can evict entries even when there is no count-based maxsize. That should target the superseded dag_version_id growth without evicting the active cyclic working set.

I’d also remove the redundant conf fallbacks and add tests for the non-positive cache size / TTL path.

Does that direction sound reasonable before I rework the PR?

cache_size = len(self._dags)
if self._use_cache:
stats.gauge("api_server.dag_bag.cache_size", cache_size, rate=0.1)
stats.gauge(f"{self._stats_prefix}.cache_size", cache_size, rate=0.1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the calls to f"{self._stats_prefix}.cache_*" makes check_metrics_synced_with_the_registry see {_stats_prefix}.cache_hit, {_stats_prefix}.cache_miss, etc. as missing from the registry. The runtime metric split makes sense, but the metric names need to stay representable to the registry checker, or the checker/registry needs to support this pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:ConfigTemplates area:Scheduler including HA (high availability) scheduler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Scheduler DBDagBag cache is never evicted and grows unbounded

5 participants