Skip to content

Add Scala Spark ETL example to Java-SDK e2e tests #68939

Open
jason810496 wants to merge 4 commits into
apache:mainfrom
jason810496:ci/java-sdk/add-scala-spark-example
Open

Add Scala Spark ETL example to Java-SDK e2e tests #68939
jason810496 wants to merge 4 commits into
apache:mainfrom
jason810496:ci/java-sdk/add-scala-spark-example

Conversation

@jason810496

Copy link
Copy Markdown
Member

Why

Demonstrate and regression-test that the Java SDK can run a real Scala + Apache Spark workload, with task logs routed into Airflow via Log4j 2.

What

  • Add java-sdk/scala_spark_example: a standalone Scala + Spark 3.5 (local mode) ETL bundle whose three tasks pass scalar results over XCom and log through Log4j 2 (airflow-sdk-log4j2).
  • Run it inside the existing java_sdk e2e via a second coordinator and queue (scala-jdk / scala) with its own jars_root, keeping the Java example bundle Spark-free.
  • Pin the e2e worker JRE to Java 17 and pass Spark's --add-opens JVM args.
  • Add TestJavaSDKScalaSparkExample asserting the tasks succeed and the XComs match the fixed dataset (5 rows, total revenue 1000).

Was generative AI tooling used to co-author this PR?

## Why

Demonstrate and regression-test that the Java SDK can run a real Scala +
Apache Spark workload, with task logs routed into Airflow via Log4j 2.

## What

- Add `java-sdk/scala_spark_example`: a standalone Scala + Spark 3.5 (local
  mode) ETL bundle whose three tasks pass scalar results over XCom and log
  through Log4j 2 (`airflow-sdk-log4j2`).
- Run it inside the existing `java_sdk` e2e via a second coordinator and queue
  (`scala-jdk` / `scala`) with its own `jars_root`, keeping the Java example
  bundle Spark-free.
- Pin the e2e worker JRE to Java 17 and pass Spark's `--add-opens` JVM args.
- Add `TestJavaSDKScalaSparkExample` asserting the tasks succeed and the XComs
  match the fixed dataset (5 rows, total revenue 1000).
@jason810496 jason810496 self-assigned this Jun 24, 2026
@jason810496 jason810496 added AIP-108: java-sdk Change this to an 'area:' label after AIP acceptance. and removed backport-to-v3-3-test Backport to v3-3-test labels Jun 24, 2026
The e2e test asserted only the extract and load XComs, so the aggregation
stage in the middle of the pipeline could regress without the test noticing.
Assert its XCom as well, drop the unused dataset constants, and note why the
transform reads the upstream count it does not reuse.
The Scala Spark coordinator launched the bundle JVM with a hand-curated
subset of Spark's Java 17 module openings. Spark normally injects its full
default set through its own launcher, which the raw JavaCoordinator launch
bypasses. The subset is enough for the toy aggregation but omits openings
that real Spark code paths need (Kryo reflection, off-heap cleaner, charset
decoding, Kerberos), so the example would mislead anyone copying it for a
non-trivial Spark workload. Mirror Spark 3.5.8's full default module option
set instead.
@jason810496 jason810496 marked this pull request as ready for review June 25, 2026 07:33
@jason810496 jason810496 requested a review from uranusjr June 25, 2026 07:33
Comment thread .pre-commit-config.yaml
^java-sdk/gradlew$|
^java-sdk/gradlew\.bat$|
^java-sdk/gradle|
^java-sdk/scala_spark_example/src/scala/org/apache/airflow/example/ScalaSparkExample\.scala$|

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which word in the file triggers this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The master("local[1]") call triggers the inclusive-language.

@uranusjr

Copy link
Copy Markdown
Member

We should proably restructure the example project layout to avoid too many free directories lying around. I’ll work on that after this is merged.

@jason810496

Copy link
Copy Markdown
Member Author

We should proably restructure the example project layout to avoid too many free directories lying around. I’ll work on that after this is merged.

Sure, I explicitly split the scala one out of existing example project as the spark scala one will add unnecessary scala deps to the existing one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AIP-108: java-sdk Change this to an 'area:' label after AIP acceptance. area:dev-tools

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants