Skip to content

Add Coordinator Layer and Java Coordinator#65958

Draft
jason810496 wants to merge 51 commits intoapache:mainfrom
astronomer:task-sdk/feature/coordinator-interface
Draft

Add Coordinator Layer and Java Coordinator#65958
jason810496 wants to merge 51 commits intoapache:mainfrom
astronomer:task-sdk/feature/coordinator-interface

Conversation

@jason810496
Copy link
Copy Markdown
Member

@jason810496 jason810496 commented Apr 27, 2026

Add Coordinator Layer and Java Coordinator

  1. Add Java SDK #65956
  2. Add Coordinator Layer and Java Coordinator #65958 (this PR)
  3. Add CI, E2E Tests, and Pre-commit Hooks for Java SDK #65959
  • Try it out: A combined PoC branch with all changes cherry-picked is available at [DON'T MERGE] Java SDK All #65960 for reviewers who want to test the full integration end-to-end.

Why

Airflow's DAG file processor and task runner only understand Python. To run DAGs and tasks authored in other languages (Java now, Go/Rust later), both the parsing pipeline and the execution pipeline need a language-agnostic extension point that delegates to an external runtime subprocess.

How

The Coordinator Abstraction

A new BaseCoordinator base class in the Task SDK (task-sdk/src/airflow/sdk/execution_time/coordinator.py) defines the extension point. Language providers subclass it and implement three methods:

Method Purpose
can_handle_dag_file(bundle_name, path) File Discovery (e.g., "is this a valid JAR that we can parse?")
dag_parsing_runtime_cmd(...) Returns the subprocess command for DAG parsing
task_execution_runtime_cmd(...) Returns the subprocess command for task execution

The base class owns the full subprocess lifecycle: TCP server creation, subprocess spawning, connection acceptance, and a selector-based byte-forwarding bridge between the Airflow supervisor (fd 0) and the language runtime (TCP socket). The shared I/O loop is extracted into selector_loop.py and reused by WatchedSubprocess.

Discovery and Routing

Providers register coordinators in provider.yaml under a new coordinators key. ProvidersManager (airflow-core) and ProvidersManagerTaskRuntime (task-sdk) both discover them:

  • DAG Parsing: DagFileProcessorProcess._resolve_processor_target() iterates registered coordinators — the first whose can_handle_dag_file() returns True handles the file.
  • Task Execution: task_runner._resolve_runtime_entrypoint() uses a two-step resolution: first it consults the [sdk] queue_to_sdk mapping (queue name to coordinator runtime name), then it falls back to matching DAG file extensions against registered coordinators.

Queue-Based Runtime Routing

Tasks are routed to non-Python runtimes via their queue assignment and a configuration mapping. Operators set queue="java-queue" (or any custom queue name), and the [sdk] queue_to_sdk config maps queue names to coordinator runtime names:

[sdk]
queue_to_sdk = {"java-queue": "java"}

This avoids adding new columns or API fields -- the existing queue field carries the routing signal from scheduling to execution, and the mapping is resolved at task execution time.

Java Provider

A new apache-airflow-providers-sdk-java provider implements JavaCoordinator:

  • can_handle_dag_file: checks if the file is a JAR with valid Airflow Java SDK manifest attributes
  • dag_parsing_runtime_cmd: constructs java -classpath <bundle>/* <MainClass> --comm=... --logs=...
  • task_execution_runtime_cmd: handles both pure Java DAGs (JAR path) and Python stub DAGs (resolves bundle from [java] bundles_folder config)
  • get_code_from_file: extracts embedded .java source from the JAR for Airflow UI display

What

Task SDK (task-sdk/)

  • Add BaseCoordinator abstract base class with full subprocess bridge lifecycle
  • Add selector_loop.py — shared selector-based I/O utilities, refactored out of supervisor.py
  • Add _resolve_runtime_entrypoint() to task_runner.py with queue-based and file-extension-based dispatch
  • Add QueueToCoordinatorMapper for resolving queue names to coordinators via [sdk] queue_to_sdk config
  • Extract resolve_bundle() helper for reuse by both Python and coordinator paths
  • Register coordinators discovery in ProvidersManagerTaskRuntime

Airflow Core (airflow-core/)

  • Add [sdk] queue_to_sdk configuration option for queue-to-runtime mapping
  • Extend DagFileProcessorProcess.start() with _resolve_processor_target() for coordinator delegation
  • Extend DagFileProcessorManager to recognize runtime file extensions (e.g., .jar) and skip ZIP inspection for them
  • Extend DagCode.get_code_from_file() to delegate to coordinator's get_code_from_file()
  • Add coordinators extension point to provider.yaml.schema.json and provider_info.schema.json
  • Register coordinators discovery in ProvidersManager

Java Provider (providers/sdk/java/)

  • Add JavaCoordinator with DAG parsing, task execution, and code extraction
  • Add BundleScanner for JAR manifest inspection and bundle resolution
  • Add provider.yaml with coordinators registration and [java] bundles_folder config
  • Add provider packaging (pyproject.toml, docs, LICENSE, NOTICE)
  • Add java_sdk_setup.sh for Breeze development environment

Was generative AI tooling used to co-author this PR?

Co-authored-by: Tzu-ping Chung uranusjr@gmail.com

@jason810496 jason810496 removed the backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch label Apr 27, 2026
@jason810496 jason810496 self-assigned this Apr 27, 2026
@uranusjr uranusjr added AIP-108: java-sdk Change this to an 'area:' label after AIP acceptance. AIP-108: Coordinator Change this to an 'area:' label after AIP acceptance. labels Apr 28, 2026
Comment thread task-sdk/src/airflow/sdk/definitions/mappedoperator.py Outdated
@jason810496 jason810496 force-pushed the task-sdk/feature/coordinator-interface branch from 688d569 to 59d5a47 Compare April 28, 2026 11:02
Multi-Language extras
=====================

These are extras that add dependencies needed for integration with other languages runtimes. Currently we have only Java SDK related extra, but in the future we might add more extras related to other languages runtimes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go SDK would not be listed here?

Copy link
Copy Markdown
Member Author

@jason810496 jason810496 Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the go-sdk adapt the coordinator interface as a provider, I will update the description here to avoid the confusion.

Comment thread airflow-core/src/airflow/config_templates/config.yml Outdated
from airflow.providers_manager import ProvidersManager

extensions: list[str] = []
for coordinator_cls in ProvidersManager().coordinators:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we assume that (if multiple) all Dag Parsers load the language interpreters? I could well imagine I spin one (or multiple) Dag parser for Python and one additional for Java - then would deploy the JAR and JDK only to the instances where needed... and on the Python Dag parser would add the GitSyncBundle... (which on the Java side probably is not used).

Not sure if everybody likes to deploy a JDK into each Dag parser environment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The answer is similar to #65956 (comment) comment.

Only if the dag-processor install the target sdk.<lang> provider will enable the dag-parsing for pure--dag.

Comment thread task-sdk/src/airflow/sdk/execution_time/workloads/task.py

def _start_server() -> socket.socket:
"""Create a TCP server socket bound to a random port on localhost."""
server = socket.socket()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in the other PR - I am sceptcal that TCP sockets should be used as well as I do not think it is a good idea defining a proprietary protocol.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, the Python implementation in use since 3.0 already uses the same mechnism. (It just creates the TCP sockets in another way.)

@jason810496 jason810496 force-pushed the task-sdk/feature/coordinator-interface branch 6 times, most recently from d2f28c8 to 52dcb2a Compare April 30, 2026 14:33
Copy link
Copy Markdown
Member

@uranusjr uranusjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for many other tests

],
)
@time_machine.travel("2025-01-01 00:00:00", tick=False)
@time_machine.travel(datetime(2025, 1, 1, 0, 0, 0, tzinfo=timezone.utc), tick=False)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@time_machine.travel(datetime(2025, 1, 1, 0, 0, 0, tzinfo=timezone.utc), tick=False)
@time_machine.travel(datetime(2025, 1, 1, tzinfo=timezone.utc), tick=False)

jason810496 and others added 25 commits May 7, 2026 11:07
Tweak coordinator class names, attribute names, and method names to be
shorter and avoid the term 'runtime'.
- Remove Java SDK setup in Dockerfile
- add multi-language extras documentation
- Update TaskInstanceDTO description, and adjust API version in generated files
- Update JavaCoordinator to use TaskInstanceDTO
- add compatibility check for Airflow >= 3.3.0
- Updated the Airflow issue template to include 'sdk-java' as an option.
- Added unit tests for JavaCoordinator functionality.
- Created a new test file for Java bundle scanning.
- Updated uv.lock to reflect new dependency requirements for tomli.
Replace TaskInstance with TaskInstanceDTO in StartupDetails fixtures
and add the required pool_slots, queue, and priority_weight fields.
DagCode.get_code_from_file probes every coordinator's can_handle_dag_file
on each fileloc, including .py paths nested inside ZIP DAGs (e.g.
test_zip.zip/test_zip.py). The Java coordinator opened these as JAR
files, raising NotADirectoryError because the parent path is a ZIP file
rather than a directory. Short-circuit on the .jar suffix and add
NotADirectoryError to the suppressed exceptions for safety.
The config.yml description duplicated the example field as a literal
"Example:" line in the description text. With --include-descriptions
this rendered as "# Example:", which trips
test_cli_show_config_shows_descriptions. The example is already in the
dedicated example field, so remove the duplicate from the description.
apache-airflow-providers-sdk-java requires apache-airflow>=3.3.0, so
installing it against the 2.11.1 / 3.0.6 / 3.1.8 / 3.2.1 compat
targets fails dependency resolution. Add it to remove-providers for
each older-Airflow row in PROVIDERS_COMPATIBILITY_TESTS_MATRIX.

Also silence mypy no-redef on dev/registry tomli fallback imports,
which now trip the mypy-dev hook because tomli is resolvable in the
mypy environment after recent uv.lock updates.
Import TaskInstanceDTO from the same airflow.sdk._shared.workloads
namespace that BaseCoordinator uses. The previous import via
airflow._shared.workloads pointed at the same physical file via a
symlink but mypy treated the two namespaces as distinct types,
flagging the override as a Liskov violation.
* Add 'sdk' to empty_subpackages in provider_conf so the autoapi-
  generated _api/airflow/providers/sdk/index.rst is excluded the
  same way the other namespace-only directories are. Without this,
  Sphinx warned that the document was not in any toctree.
* Fix the relative include paths in security.rst and installing-
  providers-from-sources.rst. Nested providers (those under a
  namespace package like sdk/) sit one directory deeper than
  flat providers, so the include needs four ../ segments instead
  of three to reach devel-common/src/sphinx_exts/includes/.
- Removed the shared workloads dependency from pyproject.toml and related files.
- Deleted the workloads directory and its references in the codebase.
- Refactored imports of TaskInstanceDTO to point to the new location in execution_time.workloads.task.
- Introduced new files for TaskInstanceDTO and its base class in the execution_time module.
- Updated tests to reflect the changes in TaskInstanceDTO imports.
@jason810496 jason810496 force-pushed the task-sdk/feature/coordinator-interface branch from b1125a1 to 3d719d1 Compare May 7, 2026 03:07
@jason810496 jason810496 changed the title Add Coordinator Layer and Java Provider Add Coordinator Layer and Java Coordinator May 7, 2026
…ample

Add JavaCoordinator, jvm, openjdk, Xmx to the docs spelling wordlist so
the rendered configurations-ref doesn't fail Sphinx spellcheck on the
[sdk] coordinators example. Also indent multi-line example/default
values by 8 spaces in the shared sections-and-options template so the
rendered RST code-block keeps consistent indentation and doesn't
break the field list.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AIP-108: Coordinator Change this to an 'area:' label after AIP acceptance. AIP-108: java-sdk Change this to an 'area:' label after AIP acceptance. area:ConfigTemplates area:DAG-processing area:dev-tools area:Executors-core LocalExecutor & SequentialExecutor area:providers area:task-sdk kind:documentation provider:standard

Development

Successfully merging this pull request may close these issues.

6 participants