1. Executive Summary
This proposal outlines the roadmap for RayDP 2.0, a major architectural upgrade designed to support the next generation of big data infrastructure. The primary goal is to migrate the core runtime to Apache Spark 4.1.1, enforcing Java 17 and Scala 2.13. This modernization addresses the deprecation of older JVMs in the Spark ecosystem and aligns RayDP with the performance and security standards of 2025+.
2. Motivation
- Spark 4.0 Paradigm Shift: Apache Spark 4.0 has officially dropped support for Scala 2.12 and Java 8/11. To ensure RayDP remains compatible with modern data stacks, we must align with these upstream requirements.
- Performance & Security: Moving to Java 17 (LTS) unlocks significant garbage collection improvements (ZGC), better container awareness, and enhanced security features required by enterprise deployments.
- Technical Debt Resolution: The previous build system relied on legacy
setup.py behaviors and Maven plugins incompatible with the Java module system (JPMS). This overhaul modernizes the entire build chain.
3. Technical Specification
3.1 Core Dependency Matrix
RayDP 2.0 introduces a strict upgrade to the dependency matrix.
| Component |
RayDP 1.x (Legacy) |
RayDP 2.0 (New) |
Impact |
| Java |
8 / 11 |
17 (Strict) |
Requires environment updates on all Ray nodes. |
| Spark |
3.x |
4.1.1 |
Major internal API drift (Scheduler, SQL, RPC). |
| Scala |
2.12 |
2.13 |
Incompatible binary interface; requires recompilation. |
| Ray |
2.x |
2.53.0+ |
Aligned with recent Ray releases. |
| Python |
3.6+ |
3.10+ |
Dropped support for EOL Python versions. |
3.2 Architectural Changes
A. The Shim Layer Strategy
Spark 4.1 introduces aggressive encapsulation and API removals. To handle this, we have restructured the core/shims module:
- Legacy Shims Disabled: Modules for Spark 3.2–3.5 are disabled in the build reactor.
- New Shim (
raydp-shims-spark411): A dedicated module implementing the SparkShims interface for the 4.1.x line.
- Package-Private Accessors: Due to stricter visibility in Spark 4.x, we introduced helper classes in the
org.apache.spark namespace:
Spark411Helper: Handles TaskContextImpl instantiation and CoarseGrainedExecutorBackend creation.
Spark411SQLHelper: Bridges ArrowConverters for toDataFrame operations, handling the new signature requirements for Arrow batch conversion.
B. Scheduler Backend Refactoring
The RayCoarseGrainedSchedulerBackend has been significantly refactored:
- Removed Deprecated APIs: Calls to
Utils.sparkJavaOpts (removed in Spark 4.x) have been replaced with direct configuration handling.
- Constructor Updates: Adapted to the new
HadoopDelegationTokenManager and CoarseGrainedSchedulerBackend signatures.
- Scala 2.13 Compliance: Migrated all collection conversions from
JavaConverters to CollectionConverters and handled explicit boxing for primitive types.
3.3 Build System Overhaul
- PEP 517 Compliance: The Python build system now uses
pyproject.toml as the source of truth for dependencies and metadata.
- Editable Installs:
setup.py was rewritten to support pip install -e . by correctly copying compiled JARs from core/ into the raydp/jars source tree during development. This fixes runtime IndexError and ClassNotFoundException issues during local testing.
- Maven Modernization: Updated
maven-compiler-plugin, scala-maven-plugin, and scalatest to versions compatible with Java 17 modules.
4. Breaking Changes
Users upgrading to RayDP 2.0 must be aware of the following:
- Java 8/11 is no longer supported. Attempting to run RayDP 2.0 on older JVMs will result in
UnsupportedClassVersionError.
- Spark 3.x is no longer supported. The codebase is not backward compatible.
- Scala 2.12 is no longer supported. Custom UDFs or libraries compiled against Scala 2.12 will cause binary incompatibility errors.
5. Verification Plan
We have established a rigorous verification suite (examples/test_raydp_sanity.py) to validate the migration:
- [x] Cluster Initialization: Verifies that
RayDPSparkMaster and RayDPExecutor actors start successfully on Ray.
- [x] Network Binding: Validates that Spark Driver and Executors can communicate, specifically addressing macOS split-brain networking (Localhost vs LAN IP) via explicit binding.
- [x] Job Execution: Runs standard Spark actions (
count, range) to verify the RPC layer.
- [x] Data Transfer (Spark → Ray): Validates
ray.data.from_spark() using the updated ObjectStoreWriter.
- [x] Data Transfer (Ray → Spark): Validates
ds.to_spark() using the new Spark411SQLHelper to reconstruct DataFrames from Arrow batches.
6. Migration Guide
To upgrade to RayDP 2.0:
- Upgrade Java: Install OpenJDK 17 on your local machine and all cluster nodes.
- Set JAVA_HOME: Ensure
JAVA_HOME points to the JDK 17 installation.
- Update Python Deps:
pip install pyspark>=4.1.1 ray>=2.53.0.
- Rebuild: Run
mvn clean package in core/ and pip install . in python/.
1. Executive Summary
This proposal outlines the roadmap for RayDP 2.0, a major architectural upgrade designed to support the next generation of big data infrastructure. The primary goal is to migrate the core runtime to Apache Spark 4.1.1, enforcing Java 17 and Scala 2.13. This modernization addresses the deprecation of older JVMs in the Spark ecosystem and aligns RayDP with the performance and security standards of 2025+.
2. Motivation
setup.pybehaviors and Maven plugins incompatible with the Java module system (JPMS). This overhaul modernizes the entire build chain.3. Technical Specification
3.1 Core Dependency Matrix
RayDP 2.0 introduces a strict upgrade to the dependency matrix.
3.2 Architectural Changes
A. The Shim Layer Strategy
Spark 4.1 introduces aggressive encapsulation and API removals. To handle this, we have restructured the
core/shimsmodule:raydp-shims-spark411): A dedicated module implementing theSparkShimsinterface for the 4.1.x line.org.apache.sparknamespace:Spark411Helper: HandlesTaskContextImplinstantiation andCoarseGrainedExecutorBackendcreation.Spark411SQLHelper: BridgesArrowConvertersfortoDataFrameoperations, handling the new signature requirements for Arrow batch conversion.B. Scheduler Backend Refactoring
The
RayCoarseGrainedSchedulerBackendhas been significantly refactored:Utils.sparkJavaOpts(removed in Spark 4.x) have been replaced with direct configuration handling.HadoopDelegationTokenManagerandCoarseGrainedSchedulerBackendsignatures.JavaConverterstoCollectionConvertersand handled explicit boxing for primitive types.3.3 Build System Overhaul
pyproject.tomlas the source of truth for dependencies and metadata.setup.pywas rewritten to supportpip install -e .by correctly copying compiled JARs fromcore/into theraydp/jarssource tree during development. This fixes runtimeIndexErrorandClassNotFoundExceptionissues during local testing.maven-compiler-plugin,scala-maven-plugin, andscalatestto versions compatible with Java 17 modules.4. Breaking Changes
Users upgrading to RayDP 2.0 must be aware of the following:
UnsupportedClassVersionError.5. Verification Plan
We have established a rigorous verification suite (
examples/test_raydp_sanity.py) to validate the migration:RayDPSparkMasterandRayDPExecutoractors start successfully on Ray.count,range) to verify the RPC layer.ray.data.from_spark()using the updatedObjectStoreWriter.ds.to_spark()using the newSpark411SQLHelperto reconstruct DataFrames from Arrow batches.6. Migration Guide
To upgrade to RayDP 2.0:
JAVA_HOMEpoints to the JDK 17 installation.pip install pyspark>=4.1.1 ray>=2.53.0.mvn clean packageincore/andpip install .inpython/.