-
Notifications
You must be signed in to change notification settings - Fork 17.3k
Use bounded DBDagBag cache in scheduler #69007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -347,8 +347,25 @@ def __init__( | |||||||||
|
|
||||||||||
| if log: | ||||||||||
| self._log = log | ||||||||||
| dag_cache_size = conf.getint("scheduler", "dag_cache_size", fallback=1024) | ||||||||||
| dag_cache_ttl_config = conf.getint("scheduler", "dag_cache_ttl", fallback=10800) | ||||||||||
|
Comment on lines
+350
to
+351
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Therer is already a default in the conf. I wouldn't put a fallback too.
Suggested change
|
||||||||||
|
|
||||||||||
| self.scheduler_dag_bag = DBDagBag(load_op_links=False) | ||||||||||
| if dag_cache_size < 0: | ||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Minor inconsistency: API server uses cache_size <= 0 → unbounded; scheduler uses < 0 → warn+0, == 0 → unbounded. Both end at "0 = unbounded," so behavior matches, just expressed differently. |
||||||||||
| self.log.warning("scheduler dag_cache_size must be >= 0, using unbounded dict") | ||||||||||
| dag_cache_size = 0 | ||||||||||
|
|
||||||||||
| if dag_cache_ttl_config < 0: | ||||||||||
| self.log.warning("scheduler dag_cache_ttl must be >= 0, disabling TTL") | ||||||||||
| dag_cache_ttl_config = 0 | ||||||||||
|
|
||||||||||
| dag_cache_ttl = dag_cache_ttl_config if dag_cache_ttl_config > 0 else None | ||||||||||
|
|
||||||||||
| self.scheduler_dag_bag = DBDagBag( | ||||||||||
| load_op_links=False, | ||||||||||
| cache_size=dag_cache_size, | ||||||||||
| cache_ttl=dag_cache_ttl, | ||||||||||
|
Mady356 marked this conversation as resolved.
|
||||||||||
| stats_prefix="scheduler.dag_bag", | ||||||||||
|
Comment on lines
+363
to
+367
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This enables the bounded LRU/TTL cache for the scheduler, but the scheduler's access pattern is the one case where a count-based cap backfires, so I think the approach needs a rethink before merge. #60804 deliberately left the scheduler on the unbounded dict (its description: "The scheduler continues using a plain unbounded dict with zero lock overhead") and enabled the bounded cache for the API server only, because the access profiles differ:
A cyclic sweep over N distinct The leak here is superseded Minor, worth noting in the description: enabling the cache also flips the scheduler from
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @kaxil. I agree with your analysis for the scheduler use case. Before opening #69001, I went back and read through the design decisions in #60804 because I expected we should first discuss the right approach before implementing (or not) anything. One thing I wasn't completely sure about from that discussion is whether the intended solution for the scheduler was simply to rely on If we do want to introduce cache eviction for the scheduler, defaulting Regarding the
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That makes a lot of sense and I understand why a low count-based LRU default is risky for the scheduler: if the active dag_version_id working set is larger than the cache size, the scheduler can end up thrashing and repeatedly paying the DB read/deserialization cost. My current thinking is to revise this so scheduler eviction is TTL-driven by default instead of size-cap driven. Concretely, that would mean defaulting scheduler dag_cache_size to 0, keeping dag_cache_ttl set, and updating DBDagBag so a non-zero TTL can evict entries even when there is no count-based maxsize. That should target the superseded dag_version_id growth without evicting the active cyclic working set. I’d also remove the redundant conf fallbacks and add tests for the non-positive cache size / TTL path. Does that direction sound reasonable before I rework the PR? |
||||||||||
| ) | ||||||||||
|
|
||||||||||
| # Set of (dag_id, asset_name, asset_uri) tuples for trigger policies that | ||||||||||
| # are permanently unreachable for the rollup window's cardinality — the | ||||||||||
|
|
||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -63,8 +63,7 @@ class DBDagBag: | |
| Internal class for retrieving dags from the database. | ||
|
|
||
| Optionally supports LRU+TTL caching when cache_size is provided. | ||
| The scheduler uses this without caching, while the API server can | ||
| enable caching via configuration. | ||
| Callers can enable bounded caching by passing cache_size and cache_ttl. | ||
|
|
||
| :meta private: | ||
| """ | ||
|
|
@@ -74,15 +73,18 @@ def __init__( | |
| load_op_links: bool = True, | ||
| cache_size: int | None = None, | ||
| cache_ttl: int | None = None, | ||
| stats_prefix: str = "api_server.dag_bag", | ||
|
Mady356 marked this conversation as resolved.
|
||
| ) -> None: | ||
| """ | ||
| Initialize DBDagBag. | ||
|
|
||
| :param load_op_links: Should the extra operator link be loaded when de-serializing the DAG? | ||
| :param cache_size: Size of LRU cache. If None or 0, uses unbounded dict (no eviction). | ||
| :param cache_ttl: Time-to-live for cache entries in seconds. If None or 0, no TTL (LRU only). | ||
| :param stats_prefix: Prefix for cache-related metrics emitted by this DBDagBag. | ||
| """ | ||
| self.load_op_links = load_op_links | ||
| self._stats_prefix = stats_prefix | ||
| self._dags: MutableMapping[UUID | str, _CacheEntry] = {} | ||
| self._use_cache = False | ||
|
|
||
|
|
@@ -111,7 +113,7 @@ def _read_dag(self, serdag: SerializedDagModel) -> SerializedDAG | None: | |
| self._dags[serdag.dag_version_id] = _CacheEntry(dag, serdag.dag_hash, time.monotonic()) | ||
| cache_size = len(self._dags) | ||
| if self._use_cache: | ||
| stats.gauge("api_server.dag_bag.cache_size", cache_size, rate=0.1) | ||
| stats.gauge(f"{self._stats_prefix}.cache_size", cache_size, rate=0.1) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changing the calls to |
||
| return dag | ||
|
|
||
| @staticmethod | ||
|
|
@@ -134,7 +136,7 @@ def _get_dag(self, version_id: UUID | str, session: Session) -> SerializedDAG | | |
| # cannot have gone stale yet -- serve it without touching the DB. | ||
| if now - cached.last_validated < self._revalidation_interval: | ||
| if self._use_cache: | ||
| stats.incr("api_server.dag_bag.cache_hit") | ||
| stats.incr(f"{self._stats_prefix}.cache_hit") | ||
| return cached.dag | ||
| # Past the window: a version may have been updated in place (same dag_version_id, new | ||
| # content + new dag_hash) by SerializedDagModel.write_dag, so confirm the cached copy | ||
|
|
@@ -149,7 +151,7 @@ def _get_dag(self, version_id: UUID | str, session: Session) -> SerializedDAG | | |
| if current is not None and current.dag_hash == cached.dag_hash: | ||
| self._dags[version_id] = current._replace(last_validated=now) | ||
| if self._use_cache: | ||
| stats.incr("api_server.dag_bag.cache_hit") | ||
| stats.incr(f"{self._stats_prefix}.cache_hit") | ||
| return cached.dag | ||
| # Stale (updated in place) or the version no longer exists: drop and reload below. | ||
| with self._lock: | ||
|
|
@@ -169,9 +171,9 @@ def _get_dag(self, version_id: UUID | str, session: Session) -> SerializedDAG | | |
| if self._use_cache: | ||
| with self._lock: | ||
| if (cached := self._dags.get(version_id)) is not None: | ||
| stats.incr("api_server.dag_bag.cache_hit") | ||
| stats.incr(f"{self._stats_prefix}.cache_hit") | ||
| return cached.dag | ||
| stats.incr("api_server.dag_bag.cache_miss") | ||
| stats.incr(f"{self._stats_prefix}.cache_miss") | ||
| return self._read_dag(serdag) | ||
|
|
||
| def get_dag(self, version_id: UUID | str, session: Session) -> SerializedDAG | None: | ||
|
|
@@ -203,8 +205,8 @@ def clear_cache(self) -> int: | |
| self._dags.clear() | ||
|
|
||
| if self._use_cache: | ||
| stats.incr("api_server.dag_bag.cache_clear") | ||
| stats.gauge("api_server.dag_bag.cache_size", 0) | ||
| stats.incr(f"{self._stats_prefix}.cache_clear") | ||
| stats.gauge(f"{self._stats_prefix}.cache_size", 0) | ||
| return count | ||
|
|
||
| @staticmethod | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -408,6 +408,24 @@ def test_heartrate(self, heartrate): | |
| _ = SchedulerJobRunner(job=scheduler_job, executors=[self.null_exec]) | ||
| assert scheduler_job.heartrate == heartrate | ||
|
|
||
| @patch("airflow.jobs.scheduler_job_runner.DBDagBag") | ||
| def test_scheduler_dag_bag_uses_scheduler_cache_config(self, mock_db_dag_bag): | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing test for |
||
| with conf_vars( | ||
| { | ||
| ("scheduler", "dag_cache_size"): "123", | ||
| ("scheduler", "dag_cache_ttl"): "456", | ||
| } | ||
| ): | ||
| scheduler_job = Job() | ||
| SchedulerJobRunner(job=scheduler_job, executors=[self.null_exec]) | ||
|
|
||
| mock_db_dag_bag.assert_called_once_with( | ||
| load_op_links=False, | ||
| cache_size=123, | ||
| cache_ttl=456, | ||
| stats_prefix="scheduler.dag_bag", | ||
| ) | ||
|
|
||
| def test_no_orphan_process_will_be_left(self): | ||
| current_process = psutil.Process() | ||
| old_children = current_process.children(recursive=True) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This default might be low. That's the number of versions accross dags. Having more than 1024 dags can be common. Trading the memory issue for repeated db read.
It's configurable anyway, but maybe a bigger default can be better suited.