Add Scala Spark ETL example to Java-SDK e2e tests #68939
Open
jason810496 wants to merge 4 commits into
Open
Conversation
## Why Demonstrate and regression-test that the Java SDK can run a real Scala + Apache Spark workload, with task logs routed into Airflow via Log4j 2. ## What - Add `java-sdk/scala_spark_example`: a standalone Scala + Spark 3.5 (local mode) ETL bundle whose three tasks pass scalar results over XCom and log through Log4j 2 (`airflow-sdk-log4j2`). - Run it inside the existing `java_sdk` e2e via a second coordinator and queue (`scala-jdk` / `scala`) with its own `jars_root`, keeping the Java example bundle Spark-free. - Pin the e2e worker JRE to Java 17 and pass Spark's `--add-opens` JVM args. - Add `TestJavaSDKScalaSparkExample` asserting the tasks succeed and the XComs match the fixed dataset (5 rows, total revenue 1000).
phanikumv
reviewed
Jun 25, 2026
The e2e test asserted only the extract and load XComs, so the aggregation stage in the middle of the pipeline could regress without the test noticing. Assert its XCom as well, drop the unused dataset constants, and note why the transform reads the upstream count it does not reuse.
The Scala Spark coordinator launched the bundle JVM with a hand-curated subset of Spark's Java 17 module openings. Spark normally injects its full default set through its own launcher, which the raw JavaCoordinator launch bypasses. The subset is enough for the toy aggregation but omits openings that real Spark code paths need (Kryo reflection, off-heap cleaner, charset decoding, Kerberos), so the example would mislead anyone copying it for a non-trivial Spark workload. Mirror Spark 3.5.8's full default module option set instead.
uranusjr
reviewed
Jun 26, 2026
| ^java-sdk/gradlew$| | ||
| ^java-sdk/gradlew\.bat$| | ||
| ^java-sdk/gradle| | ||
| ^java-sdk/scala_spark_example/src/scala/org/apache/airflow/example/ScalaSparkExample\.scala$| |
Member
There was a problem hiding this comment.
Which word in the file triggers this?
Member
Author
There was a problem hiding this comment.
The master("local[1]") call triggers the inclusive-language.
Member
|
We should proably restructure the example project layout to avoid too many free directories lying around. I’ll work on that after this is merged. |
Member
Author
Sure, I explicitly split the scala one out of existing example project as the spark scala one will add unnecessary scala deps to the existing one. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Demonstrate and regression-test that the Java SDK can run a real Scala + Apache Spark workload, with task logs routed into Airflow via Log4j 2.
What
java-sdk/scala_spark_example: a standalone Scala + Spark 3.5 (local mode) ETL bundle whose three tasks pass scalar results over XCom and log through Log4j 2 (airflow-sdk-log4j2).java_sdke2e via a second coordinator and queue (scala-jdk/scala) with its ownjars_root, keeping the Java example bundle Spark-free.--add-opensJVM args.TestJavaSDKScalaSparkExampleasserting the tasks succeed and the XComs match the fixed dataset (5 rows, total revenue 1000).Was generative AI tooling used to co-author this PR?