Skip to content

Load example DAGs from providers via ProvidersManager (continuation of #57320)#66161

Open
andreahlert wants to merge 18 commits into
apache:mainfrom
andreahlert:pr-57320
Open

Load example DAGs from providers via ProvidersManager (continuation of #57320)#66161
andreahlert wants to merge 18 commits into
apache:mainfrom
andreahlert:pr-57320

Conversation

@andreahlert
Copy link
Copy Markdown
Contributor

Picking up #57320 since it's been stuck and #52469 is blocking the rest of the Examples Refurbish epic. Most of the work here is @bhavaniravi's, just rebased on main and went through @jscheffl's review comments.

Main thing worth calling out: provider example DAGs are now discovered via ProvidersManager instead of walking airflow.providers.__path__, so nested providers like common-sql actually get picked up. The rest is review fallout (removed the dead include_examples block in dagbag, dropped the two empty arangodb .sql files, killed the global standard-provider import in test_dagbag, decoupled the test_task_states_for_dag_run test from the standard example DAGs).

@jscheffl had asked @dstandish about why sync_bundles_to_db only touches bundle metadata and not DAGs, no answer ever came back, so I just documented the actual contract on the method.

closes: #52469


Was generative AI tooling used to co-author this PR?
  • Yes

bhavaniravi and others added 6 commits April 30, 2026 09:35
1. loading example dags as bundles
2. fixing testcases to use bundles instead of dagbag folders
3. fixing testcases to use example_dags from standard module

Signed-off-by: André Ahlert <andre@aex.partners>
Signed-off-by: André Ahlert <andre@aex.partners>
Replace the directory walk over airflow.providers.__path__ with a
lookup based on ProvidersManager. The previous approach silently
skipped:

- nested providers like apache-airflow-providers-common-sql, whose
  module path is airflow.providers.common.sql (one level deeper);
- providers installed outside the airflow.providers namespace
  package, since they are not visible via os.listdir.

The new implementation iterates over the providers registered through
the apache_airflow_provider entry point, imports each provider module
and adds its example_dags folder when present. Bundle names are now
keyed on the canonical package name to keep them unique and stable
across deployments.

Signed-off-by: André Ahlert <andre@aex.partners>
The set of provider example DAG bundles depends on which providers
expose an example_dags folder, which is environment specific. Pin
only the built-in bundles and assert the prefix of any extra entry.

Signed-off-by: André Ahlert <andre@aex.partners>
Apply remaining suggestions from the original review of the closed
PR apache#57320:

- bundles/manager.py: document that sync_bundles_to_db only reconciles
  bundle metadata and does not parse or write DAG rows, clarifying the
  signature confusion raised by @bhavaniravi during review.
- tests/unit/dag_processing/test_dagbag.py: stop importing the standard
  provider's example_dags module at collection time. Move the import
  into a small helper plus a pytest fixture so tests that need the
  folder request it explicitly and the module remains collectable when
  the standard provider is not yet importable. The single parametrize
  case that referenced the folder now passes a relative file name and
  resolves the absolute path inside the test.
- tests/unit/cli/commands/test_task_command.py: build a minimal DAG
  inline in test_task_states_for_dag_run instead of importing one from
  the standard provider's example_dags. The test only checks CLI
  behaviour around a known dag_id/task_id, so reproducing the name and
  a single task is enough to keep the core test decoupled from the
  standard provider's example DAGs.

Signed-off-by: André Ahlert <andre@aex.partners>
The teardown deletes a single DagBundleModel row keyed by name. Use
session.execute(delete(...).where(...)) instead of the deprecated
session.query(...).filter_by(...).delete() form so prek's
prevent-deprecated-sqlalchemy-usage hook stays clean.

Signed-off-by: André Ahlert <andre@aex.partners>
@andreahlert
Copy link
Copy Markdown
Contributor Author

@jscheffl if you could take a look.

@jscheffl
Copy link
Copy Markdown
Contributor

Cool! Thanks for picking-up the PR started by @bhavaniravi - hope we can get this to main in the next release then! Sorry also lost focus on this. Will attempt to take a look on the (extended) weekend and will respond ASAP!

@andreahlert
Copy link
Copy Markdown
Contributor Author

Cool! Thanks for picking-up the PR started by @bhavaniravi - hope we can get this to main in the next release then! Sorry also lost focus on this. Will attempt to take a look on the (extended) weekend and will respond ASAP!

Thanks! I am Picking up a few related issues/PRs from this epic in parallel.

After moving the standard provider's example DAGs into their own
bundle, example_python_operator no longer lives in the dags-folder
bundle. Update the 24 expected payloads in test_task_instances.py
that asserted bundle_name='dags-folder' to the new
airflow-provider-apache-airflow-providers-standard-example-dags
bundle name. The lone sync_bag_to_db('dags-folder', ...) call in this
file is unrelated; it registers a synthetic dag built by dag_maker.

Signed-off-by: André Ahlert <andre@aex.partners>
On Airflow 3.1+, parse_and_sync_to_db iterates every registered DAG
bundle (including example_dags and airflow-provider-*-example-dags)
and syncs them with their own bundle name. The leading
DagBag(dag_folder=folder, include_examples=True) was also pulling
example DAGs into the dags-folder bundle, so each example DAG ended
up registered under two different bundles. The duplicated rows then
violated the (asset_id, dag_id) unique constraint on
dag_schedule_asset_reference and broke unrelated tests (notably the
fab provider tests under compat).

Force include_examples=False on the 3.1+ path; the bundle loop is
already responsible for loading example DAGs from their own bundles.
The 2.x and 3.0 paths are untouched.

Signed-off-by: André Ahlert <andre@aex.partners>
Signed-off-by: André Ahlert <andre@aex.partners>
@jscheffl jscheffl added this to the Airflow 3.3.0 milestone May 1, 2026
Copy link
Copy Markdown
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for driving the stale PR forward! Looks good to me! Just a few nit points.

Request another maintainer for another view but thing this is very good! A step forward in provider modularity!

Comment thread airflow-core/src/airflow/dag_processing/bundles/manager.py Outdated
Comment thread airflow-core/src/airflow/dag_processing/bundles/manager.py Outdated
Comment thread devel-common/src/tests_common/test_utils/db.py Outdated
andreahlert and others added 2 commits May 2, 2026 03:07
Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>
Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>
@andreahlert andreahlert requested a review from jscheffl May 2, 2026 06:28
@jscheffl jscheffl requested review from eladkal, kaxil and potiuk May 2, 2026 09:27
Copy link
Copy Markdown
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good for me, leaving this PR open for a moment for another pair of eyes. Also another opinion on "making provider discovery proper in this PR" would be good. For me it is OK like it is.

Copy link
Copy Markdown
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up — the underlying change is right, and unblocking the Examples Refurbish epic is worth getting in. Substantive comments inline; high-level summary:

  • the include_examples parameter on DagBag and on parse_and_sync_to_db is now a silent no-op (footgun);
  • the package-name → module derivation in _add_provider_example_dags_to_bundle is the "proper discovery" concern @jscheffl flagged — there's a robust alternative via the existing entry-point module recorded in discover_all_providers_from_packages;
  • the bundle name airflow-provider-apache-airflow-providers-standard-example-dags is persisted in DB rows and emitted in REST responses — please pin the format down now rather than after release.

Two things worth keeping as-is:

  • the sync_bundles_to_db docstring you added is exactly the contract that was missing — thank you (and it carries the answer @dstandish never gave);
  • the parse_and_sync_to_db comment about the dag_schedule_asset_reference collision is excellent — non-obvious failure mode that would otherwise have been rediscovered the hard way.

One repo-level item not anchorable to a line: this is an airflow-core change with a user-visible REST-API bundle-name change, so per AGENTS.md it should ship with a newsfragment in airflow-core/newsfragments/66161.significant.rst.

Holding my approval until at minimum the silent include_examples no-op is addressed; happy to follow up on the rest in a tracking issue if that's preferred.


Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting

Comment thread airflow-core/src/airflow/dag_processing/bundles/manager.py Outdated
Comment thread airflow-core/src/airflow/dag_processing/bundles/manager.py
Comment thread airflow-core/src/airflow/dag_processing/bundles/manager.py
Comment thread airflow-core/src/airflow/dag_processing/bundles/manager.py Outdated
Comment thread airflow-core/src/airflow/dag_processing/bundles/manager.py Outdated
Comment thread airflow-core/src/airflow/dag_processing/dagbag.py
Comment thread devel-common/src/tests_common/test_utils/db.py Outdated
@andreahlert
Copy link
Copy Markdown
Contributor Author

Thanks for picking this up — the underlying change is right, and unblocking the Examples Refurbish epic is worth getting in. Substantive comments inline; high-level summary:

  • the include_examples parameter on DagBag and on parse_and_sync_to_db is now a silent no-op (footgun);
  • the package-name → module derivation in _add_provider_example_dags_to_bundle is the "proper discovery" concern @jscheffl flagged — there's a robust alternative via the existing entry-point module recorded in discover_all_providers_from_packages;
  • the bundle name airflow-provider-apache-airflow-providers-standard-example-dags is persisted in DB rows and emitted in REST responses — please pin the format down now rather than after release.

Two things worth keeping as-is:

  • the sync_bundles_to_db docstring you added is exactly the contract that was missing — thank you (and it carries the answer @dstandish never gave);
  • the parse_and_sync_to_db comment about the dag_schedule_asset_reference collision is excellent — non-obvious failure mode that would otherwise have been rediscovered the hard way.

One repo-level item not anchorable to a line: this is an airflow-core change with a user-visible REST-API bundle-name change, so per AGENTS.md it should ship with a newsfragment in airflow-core/newsfragments/66161.significant.rst.

Holding my approval until at minimum the silent include_examples no-op is addressed; happy to follow up on the rest in a tracking issue if that's preferred.

Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting

Thanks for reviewing. I really need to start using Claude to have a deep review. I think that would help my deliverables.

- Hoist `importlib`, `logging`, and `ProvidersManager` to module top in
  `dag_processing/bundles/manager.py` and add a module-level logger.
- Reverse the `apache-airflow-providers-` prefix check so the canonical
  case reads first, per @jscheffl's nit.
- Broaden the exception handler around `import_module` and the
  subsequent `__path__` access to `Exception` with `log.exception`, so a
  provider with a custom `__getattr__` can no longer crash config
  loading on the scheduler/api-server.
- Switch the provider-example bundle dedup key from `bundle_name` to
  the resolved `example_dag_folder` and document the namespace-package
  scenario the guard exists for (multiple `airflow.providers.common.*`
  distributions sharing one namespace via `pkgutil.extend_path`).
- Rename the per-provider bundle from
  `airflow-provider-{package}-example-dags` to
  `{package}-example-dags`, matching the package distribution name
  surfaced to users in REST API responses and `pip list`.
- Emit a `DeprecationWarning` from `DagBag.collect_dags` and from
  `tests_common.test_utils.db.parse_and_sync_to_db` when callers pass
  `include_examples=True`, and update the docstrings to direct callers
  at the `[core] load_examples` configuration option.
- Migrate the in-tree `parse_and_sync_to_db(..., include_examples=True)`
  callers to use `conf_vars({("core", "load_examples"): "true"})`.
- Update `test_task_instances.py` bundle-name assertions and the
  `test_get_all_bundle_names` suffix check for the new format.
- Add `airflow-core/newsfragments/66161.significant.rst` covering the
  user-visible REST-API bundle-name change.

The package-name -> module-path heuristic is kept for now; replacing it
with an authoritative field on `ProviderInfo` is tracked in apache#66305.

Signed-off-by: André Ahlert <andre@aex.partners>
@andreahlert andreahlert requested a review from vincbeck as a code owner May 3, 2026 13:01
Drops the include_examples knob from DagBag.__init__, BundleDagBag.__init__,
collect_dags, and parse_and_sync_to_db. Example DAGs now come in exclusively
through the per-provider example bundles registered when [core] load_examples
is enabled. Removes the deprecation warnings, updates all callers across core,
devel-common, and providers, and gates the few provider tests that still need
to read examples on older Airflow versions behind AIRFLOW_V_3_3_PLUS.

Signed-off-by: André Ahlert <andre@aex.partners>
@andreahlert
Copy link
Copy Markdown
Contributor Author

andreahlert commented May 4, 2026

@jscheffl resolved by removing the include_examples parameter entirely from DagBag.__init__, BundleDagBag.__init__, collect_dags, and parse_and_sync_to_db (commit 0b18521). Example DAG loading is now controlled exclusively by [core] load_examples via the per-provider bundles. Updated all callers across core tests, devel-common, and providers; the three standard-provider sensor tests that still need examples on pre-3.3 Airflow are gated behind AIRFLOW_V_3_3_PLUS (with one stale gate cleaned up where the class was already version-skipped). Newsfragment updated to describe the removal under "Removals:".


Drafted-by: Claude Code (Opus 4.7); reviewed by @andreahlert before posting

@potiuk I've started using myself. Really super powers 👍🏻

@andreahlert andreahlert requested a review from potiuk May 4, 2026 13:02
@potiuk
Copy link
Copy Markdown
Member

potiuk commented May 4, 2026

@potiuk I've started using myself. Really super powers 👍🏻

Indeed.

The previous commit removed include_examples from DagBag callers but left
two follow-ups that broke prek mypy and the config CLI test:

- devel-common test_utils/db.py: pre-3.1 compat branches still call
  DagBag(include_examples=False) for older Airflow runtimes; mypy now
  flags it because the current source no longer accepts that kwarg. Add
  call-arg type-ignore (matches the existing attr-defined pattern on the
  sync_to_db calls below).
- test_config_command.py: assertions for conf.write(...) lost the
  include_examples=False kwarg, but conf.write is AirflowConfigParser.write
  (config-file examples), unrelated to the DagBag flag, and still passes
  it. Restore the kwarg in the expected call.

Signed-off-by: André Ahlert <andre@aex.partners>
@andreahlert andreahlert closed this May 4, 2026
@andreahlert andreahlert reopened this May 4, 2026
Signed-off-by: André Ahlert <andre@aex.partners>
…compat

The merge with main brought in tests in airflow-core/tests/unit/models/test_dag.py
that still passed include_examples=False to DagBag, and the dag_maker fixture in
pytest_plugin.py was loading examples by default on Airflow <3.3 because the
removed kwarg flipped the effective default to True. That broke compat 3.0.6
runs because the cleanup path then tried to roll back a session that was never
attached. Drop the kwarg from the new test_dag.py callers and gate both
dag_maker and get_test_dag DagBag construction on AIRFLOW_V_3_3_PLUS so older
Airflow versions still pass include_examples=False explicitly.

Signed-off-by: André Ahlert <andre@aex.partners>
The PR drops include_examples=False from a number of DagBag callers because
DagBag in 3.3+ no longer accepts the kwarg and never loads examples by default.
On compat runs against Airflow 3.1.x and 3.2.x, parse_and_sync_to_db hit a
branch that called DagBag(folder) with the old default include_examples=True,
which let the dags-folder DagBag pull in core example DAGs that the bundle
loop also synced, producing duplicate dag_schedule_asset_reference rows and a
UNIQUE constraint failure. On Airflow 2.11.1, DagBag callers in the standard
external_task_sensor and time_delta tests defaulted to include_examples=True
and pulled in example DAGs whose required Params lack defaults, breaking
ExternalTaskMarker tests with ParamValidationError.

Restrict the new bundle-aware sync path to AIRFLOW_V_3_3_PLUS and pass
include_examples=False explicitly on older Airflow versions in the standard
provider tests so compat 2.11.1, 3.1.8, and 3.2.1 keep their original
behaviour.

Signed-off-by: André Ahlert <andre@aex.partners>
@andreahlert
Copy link
Copy Markdown
Contributor Author

@potiuk ready to another shot when you have the time :)

@potiuk
Copy link
Copy Markdown
Member

potiuk commented May 5, 2026

Time ? What is free time ? :)

@andreahlert
Copy link
Copy Markdown
Contributor Author

andreahlert commented May 5, 2026

Time ? What is free time ? :)

I had a college teacher that used to answer "what do you do from midnight to 6am, when I ask that 😆

Comment thread airflow-core/newsfragments/66161.significant.rst Outdated
DagBag is not part of the Airflow public interface, so removing
include_examples from DagBag/collect_dags/BundleDagBag/parse_and_sync_to_db
does not need a release-notes migration recipe. Drop the internal removal
block and the Types of change / Migration rules scaffolding, keep the
bundle naming change and the bundle_name REST API behaviour change.

Signed-off-by: André Ahlert <andre@aex.partners>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Load example Dags from standard provider

4 participants