Load example DAGs from providers via ProvidersManager (continuation of #57320)#66161
Load example DAGs from providers via ProvidersManager (continuation of #57320)#66161andreahlert wants to merge 18 commits into
Conversation
1. loading example dags as bundles 2. fixing testcases to use bundles instead of dagbag folders 3. fixing testcases to use example_dags from standard module Signed-off-by: André Ahlert <andre@aex.partners>
Signed-off-by: André Ahlert <andre@aex.partners>
Replace the directory walk over airflow.providers.__path__ with a lookup based on ProvidersManager. The previous approach silently skipped: - nested providers like apache-airflow-providers-common-sql, whose module path is airflow.providers.common.sql (one level deeper); - providers installed outside the airflow.providers namespace package, since they are not visible via os.listdir. The new implementation iterates over the providers registered through the apache_airflow_provider entry point, imports each provider module and adds its example_dags folder when present. Bundle names are now keyed on the canonical package name to keep them unique and stable across deployments. Signed-off-by: André Ahlert <andre@aex.partners>
The set of provider example DAG bundles depends on which providers expose an example_dags folder, which is environment specific. Pin only the built-in bundles and assert the prefix of any extra entry. Signed-off-by: André Ahlert <andre@aex.partners>
Apply remaining suggestions from the original review of the closed PR apache#57320: - bundles/manager.py: document that sync_bundles_to_db only reconciles bundle metadata and does not parse or write DAG rows, clarifying the signature confusion raised by @bhavaniravi during review. - tests/unit/dag_processing/test_dagbag.py: stop importing the standard provider's example_dags module at collection time. Move the import into a small helper plus a pytest fixture so tests that need the folder request it explicitly and the module remains collectable when the standard provider is not yet importable. The single parametrize case that referenced the folder now passes a relative file name and resolves the absolute path inside the test. - tests/unit/cli/commands/test_task_command.py: build a minimal DAG inline in test_task_states_for_dag_run instead of importing one from the standard provider's example_dags. The test only checks CLI behaviour around a known dag_id/task_id, so reproducing the name and a single task is enough to keep the core test decoupled from the standard provider's example DAGs. Signed-off-by: André Ahlert <andre@aex.partners>
The teardown deletes a single DagBundleModel row keyed by name. Use session.execute(delete(...).where(...)) instead of the deprecated session.query(...).filter_by(...).delete() form so prek's prevent-deprecated-sqlalchemy-usage hook stays clean. Signed-off-by: André Ahlert <andre@aex.partners>
|
@jscheffl if you could take a look. |
|
Cool! Thanks for picking-up the PR started by @bhavaniravi - hope we can get this to main in the next release then! Sorry also lost focus on this. Will attempt to take a look on the (extended) weekend and will respond ASAP! |
Thanks! I am Picking up a few related issues/PRs from this epic in parallel. |
After moving the standard provider's example DAGs into their own
bundle, example_python_operator no longer lives in the dags-folder
bundle. Update the 24 expected payloads in test_task_instances.py
that asserted bundle_name='dags-folder' to the new
airflow-provider-apache-airflow-providers-standard-example-dags
bundle name. The lone sync_bag_to_db('dags-folder', ...) call in this
file is unrelated; it registers a synthetic dag built by dag_maker.
Signed-off-by: André Ahlert <andre@aex.partners>
On Airflow 3.1+, parse_and_sync_to_db iterates every registered DAG bundle (including example_dags and airflow-provider-*-example-dags) and syncs them with their own bundle name. The leading DagBag(dag_folder=folder, include_examples=True) was also pulling example DAGs into the dags-folder bundle, so each example DAG ended up registered under two different bundles. The duplicated rows then violated the (asset_id, dag_id) unique constraint on dag_schedule_asset_reference and broke unrelated tests (notably the fab provider tests under compat). Force include_examples=False on the 3.1+ path; the bundle loop is already responsible for loading example DAGs from their own bundles. The 2.x and 3.0 paths are untouched. Signed-off-by: André Ahlert <andre@aex.partners>
Signed-off-by: André Ahlert <andre@aex.partners>
jscheffl
left a comment
There was a problem hiding this comment.
Thanks for driving the stale PR forward! Looks good to me! Just a few nit points.
Request another maintainer for another view but thing this is very good! A step forward in provider modularity!
Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>
Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>
jscheffl
left a comment
There was a problem hiding this comment.
Good for me, leaving this PR open for a moment for another pair of eyes. Also another opinion on "making provider discovery proper in this PR" would be good. For me it is OK like it is.
potiuk
left a comment
There was a problem hiding this comment.
Thanks for picking this up — the underlying change is right, and unblocking the Examples Refurbish epic is worth getting in. Substantive comments inline; high-level summary:
- the
include_examplesparameter onDagBagand onparse_and_sync_to_dbis now a silent no-op (footgun); - the package-name → module derivation in
_add_provider_example_dags_to_bundleis the "proper discovery" concern @jscheffl flagged — there's a robust alternative via the existing entry-point module recorded indiscover_all_providers_from_packages; - the bundle name
airflow-provider-apache-airflow-providers-standard-example-dagsis persisted in DB rows and emitted in REST responses — please pin the format down now rather than after release.
Two things worth keeping as-is:
- the
sync_bundles_to_dbdocstring you added is exactly the contract that was missing — thank you (and it carries the answer @dstandish never gave); - the
parse_and_sync_to_dbcomment about thedag_schedule_asset_referencecollision is excellent — non-obvious failure mode that would otherwise have been rediscovered the hard way.
One repo-level item not anchorable to a line: this is an airflow-core change with a user-visible REST-API bundle-name change, so per AGENTS.md it should ship with a newsfragment in airflow-core/newsfragments/66161.significant.rst.
Holding my approval until at minimum the silent include_examples no-op is addressed; happy to follow up on the rest in a tracking issue if that's preferred.
Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting
Thanks for reviewing. I really need to start using Claude to have a deep review. I think that would help my deliverables. |
- Hoist `importlib`, `logging`, and `ProvidersManager` to module top in `dag_processing/bundles/manager.py` and add a module-level logger. - Reverse the `apache-airflow-providers-` prefix check so the canonical case reads first, per @jscheffl's nit. - Broaden the exception handler around `import_module` and the subsequent `__path__` access to `Exception` with `log.exception`, so a provider with a custom `__getattr__` can no longer crash config loading on the scheduler/api-server. - Switch the provider-example bundle dedup key from `bundle_name` to the resolved `example_dag_folder` and document the namespace-package scenario the guard exists for (multiple `airflow.providers.common.*` distributions sharing one namespace via `pkgutil.extend_path`). - Rename the per-provider bundle from `airflow-provider-{package}-example-dags` to `{package}-example-dags`, matching the package distribution name surfaced to users in REST API responses and `pip list`. - Emit a `DeprecationWarning` from `DagBag.collect_dags` and from `tests_common.test_utils.db.parse_and_sync_to_db` when callers pass `include_examples=True`, and update the docstrings to direct callers at the `[core] load_examples` configuration option. - Migrate the in-tree `parse_and_sync_to_db(..., include_examples=True)` callers to use `conf_vars({("core", "load_examples"): "true"})`. - Update `test_task_instances.py` bundle-name assertions and the `test_get_all_bundle_names` suffix check for the new format. - Add `airflow-core/newsfragments/66161.significant.rst` covering the user-visible REST-API bundle-name change. The package-name -> module-path heuristic is kept for now; replacing it with an authoritative field on `ProviderInfo` is tracked in apache#66305. Signed-off-by: André Ahlert <andre@aex.partners>
Drops the include_examples knob from DagBag.__init__, BundleDagBag.__init__, collect_dags, and parse_and_sync_to_db. Example DAGs now come in exclusively through the per-provider example bundles registered when [core] load_examples is enabled. Removes the deprecation warnings, updates all callers across core, devel-common, and providers, and gates the few provider tests that still need to read examples on older Airflow versions behind AIRFLOW_V_3_3_PLUS. Signed-off-by: André Ahlert <andre@aex.partners>
|
@jscheffl resolved by removing the Drafted-by: Claude Code (Opus 4.7); reviewed by @andreahlert before posting @potiuk I've started using myself. Really super powers 👍🏻 |
Indeed. |
The previous commit removed include_examples from DagBag callers but left two follow-ups that broke prek mypy and the config CLI test: - devel-common test_utils/db.py: pre-3.1 compat branches still call DagBag(include_examples=False) for older Airflow runtimes; mypy now flags it because the current source no longer accepts that kwarg. Add call-arg type-ignore (matches the existing attr-defined pattern on the sync_to_db calls below). - test_config_command.py: assertions for conf.write(...) lost the include_examples=False kwarg, but conf.write is AirflowConfigParser.write (config-file examples), unrelated to the DagBag flag, and still passes it. Restore the kwarg in the expected call. Signed-off-by: André Ahlert <andre@aex.partners>
Signed-off-by: André Ahlert <andre@aex.partners>
…compat The merge with main brought in tests in airflow-core/tests/unit/models/test_dag.py that still passed include_examples=False to DagBag, and the dag_maker fixture in pytest_plugin.py was loading examples by default on Airflow <3.3 because the removed kwarg flipped the effective default to True. That broke compat 3.0.6 runs because the cleanup path then tried to roll back a session that was never attached. Drop the kwarg from the new test_dag.py callers and gate both dag_maker and get_test_dag DagBag construction on AIRFLOW_V_3_3_PLUS so older Airflow versions still pass include_examples=False explicitly. Signed-off-by: André Ahlert <andre@aex.partners>
The PR drops include_examples=False from a number of DagBag callers because DagBag in 3.3+ no longer accepts the kwarg and never loads examples by default. On compat runs against Airflow 3.1.x and 3.2.x, parse_and_sync_to_db hit a branch that called DagBag(folder) with the old default include_examples=True, which let the dags-folder DagBag pull in core example DAGs that the bundle loop also synced, producing duplicate dag_schedule_asset_reference rows and a UNIQUE constraint failure. On Airflow 2.11.1, DagBag callers in the standard external_task_sensor and time_delta tests defaulted to include_examples=True and pulled in example DAGs whose required Params lack defaults, breaking ExternalTaskMarker tests with ParamValidationError. Restrict the new bundle-aware sync path to AIRFLOW_V_3_3_PLUS and pass include_examples=False explicitly on older Airflow versions in the standard provider tests so compat 2.11.1, 3.1.8, and 3.2.1 keep their original behaviour. Signed-off-by: André Ahlert <andre@aex.partners>
|
@potiuk ready to another shot when you have the time :) |
|
Time ? What is free time ? :) |
I had a college teacher that used to answer "what do you do from midnight to 6am, when I ask that 😆 |
DagBag is not part of the Airflow public interface, so removing include_examples from DagBag/collect_dags/BundleDagBag/parse_and_sync_to_db does not need a release-notes migration recipe. Drop the internal removal block and the Types of change / Migration rules scaffolding, keep the bundle naming change and the bundle_name REST API behaviour change. Signed-off-by: André Ahlert <andre@aex.partners>
Picking up #57320 since it's been stuck and #52469 is blocking the rest of the Examples Refurbish epic. Most of the work here is @bhavaniravi's, just rebased on main and went through @jscheffl's review comments.
Main thing worth calling out: provider example DAGs are now discovered via
ProvidersManagerinstead of walkingairflow.providers.__path__, so nested providers likecommon-sqlactually get picked up. The rest is review fallout (removed the deadinclude_examplesblock in dagbag, dropped the two empty arangodb.sqlfiles, killed the global standard-provider import intest_dagbag, decoupled thetest_task_states_for_dag_runtest from the standard example DAGs).@jscheffl had asked @dstandish about why
sync_bundles_to_dbonly touches bundle metadata and not DAGs, no answer ever came back, so I just documented the actual contract on the method.closes: #52469
Was generative AI tooling used to co-author this PR?