Skip to content

Minimize the amount of version-specific classes [databricks]#14834

Closed
gerashegalov wants to merge 4 commits into
NVIDIA:codex/unshim-stack-03-delta-icebergfrom
gerashegalov:codex/isolate-sql-plugin-shims
Closed

Minimize the amount of version-specific classes [databricks]#14834
gerashegalov wants to merge 4 commits into
NVIDIA:codex/unshim-stack-03-delta-icebergfrom
gerashegalov:codex/isolate-sql-plugin-shims

Conversation

@gerashegalov

@gerashegalov gerashegalov commented May 20, 2026

Copy link
Copy Markdown
Collaborator

Description

This PR continues isolating Spark-version-specific shim bytecode from the common
RAPIDS SQL plugin artifact so more classes can live in the conventional root jar
layout instead of the parallel-world layout.

The final approach keeps the existing Maven module layout. It does not add a new
module in this PR, and it does not use reflection to force bytecode identity.
Instead, packaging now treats bitwise-identical common classes as unshimmed by
default and keeps only explicit exceptions shimmed.

Review Stack

This branch has been reconstructed from the final diff against current
NVIDIA/main into four logical review layers. The old monolithic branch tip is
preserved at gerashegalov:codex/isolate-sql-plugin-shims-original.

Native gh-stack metadata could not be created because GitHub reported:

Stacked PRs are not enabled for this repository

The reviewable branch stack is:

  1. Packaging/default-unshim flow

    Branch: gerashegalov:codex/unshim-stack-01-packaging

    Diff: main...gerashegalov:spark-rapids:codex/unshim-stack-01-packaging

    Adds the default common-class root promotion flow, keep-list support,
    analyzer diagnostics, private build-info root-resource promotion, buildall fast-path support, and shim documentation.

  2. SQL plugin helper module reshaping

    Branch: gerashegalov:codex/unshim-stack-02-sql-plugin-modules

    Diff: gerashegalov/spark-rapids@codex/unshim-stack-01-packaging...codex/unshim-stack-02-sql-plugin-modules

    Moves Java-only helper surfaces into dedicated Java modules and updates
    SQL plugin, shuffle plugin, shims, tests, and Scala 2.13 build wiring.

  3. Delta/Iceberg adaptation

    Branch: gerashegalov:codex/unshim-stack-03-delta-iceberg

    Diff: gerashegalov/spark-rapids@codex/unshim-stack-02-sql-plugin-modules...codex/unshim-stack-03-delta-iceberg

    Adapts Delta Lake and Iceberg integration code to the shared helper layout.

  4. UDF/docs cleanup

    Branch: gerashegalov:codex/unshim-stack-04-udf-docs

    Diff: gerashegalov/spark-rapids@codex/unshim-stack-03-delta-iceberg...codex/unshim-stack-04-udf-docs

    Updates the UDF compiler and related documentation for the shared helper
    layout.

Ultimate Approach

The migration now uses these mechanisms:

  1. Promote bitwise-identical common classes by default.

    dist/scripts/binary-dedupe.sh already computes which files have a single
    checksum across the selected Spark shims. Common spark-shared class files
    from that proven-identical set are now promoted into the root jar layout by
    default.

    This inverts the old maintenance model. New common classes should normally
    remain unshimmed automatically as long as binary dedupe proves they are
    identical across shims.

  2. Keep a small explicit exclusion list.

    dist/keep-in-spark-shared.txt is the exception list for bitwise-identical
    common classes that still must remain in spark-shared for compatibility or
    packaging reasons. It is intentionally empty today.

  3. Keep resource and per-shim root lists narrow.

    dist/unshimmed-common-from-single-shim.txt now contains only root-layout
    resources that are not selected by default class promotion, currently
    META-INF files, the spark-rapids-private build-info resource, and Python worker files.

    dist/unshimmed-from-each-spark3xx.txt remains the mechanism for per-shim
    root artifacts. Those are not common spark-shared class files and are not
    replaced by the default common-class promotion.

  4. Retain verification and diagnostics.

    Binary dedupe still verifies that root-layout classes requiring shared
    identity are bitwise-identical across shims. The dependency analyzer now
    prints diagnostic output and writes root-safe-spark-shared.txt, but it is
    not the gate for default class promotion. The gate is binary identity.

  5. Normalize small shim implementations only when source changes are clean.

    When a class is semantically common but bytecode differs because Spark changed
    a helper signature, inherited trait shape, or constant value, the source is
    moved to common code only if the version-specific part can be expressed with
    stable public APIs or local VersionUtils checks.

    Recent examples include:

    • BridgeUnsafeProjectionCodegen: replaced Spark's
      CodeGeneratorWithInterpretedFallback dependency with local
      codegen-then-interpreted fallback logic.
    • GpuPythonFunction: uses Spark's stable PythonFuncExpression/sql
      rendering path instead of calling the version-sensitive toPrettySQL
      helper directly.
    • BloomFilterConstantsShims: common object with a runtime Spark-version
      predicate for the bloom filter format version.
    • ArrayInvalidArgumentErrorUtils: common trait that constructs the stable
      Spark runtime exception shape directly for the changed length-error API.
    • DecimalMultiply128: common helper with the existing Spark/DB version
      predicate selecting the correct JNI overload.
    • CastTimeToIntShim: common helper with a direct call from GpuCast; the
      previous reflective lookup was removed.
    • GetJsonObjectShim: common helper with a Spark 4.x predicate for the JSON
      path quoted-name regexp.

Classes remain in spark-shared or sparkXYZ when they require Spark APIs that
are absent in another supported Spark line, or when commonizing them would
require reflection or a broader bridge design.

Quantification

Source-level shim code, defined as tracked .scala and .java files under production src/main/spark* source roots:

Revision Files Lines
NVIDIA/main at b1eebc41a 474 34,622
This branch at 3e6e9fe95 458 34,224
Delta -16 -398

src/test/spark* shim source is unchanged at 149 files and 16,238 lines.
The larger impact of this PR is binary placement: most classes that still come
from shim-built artifacts no longer stay in the shim classloader layout when
binary dedupe proves they are identical.

For the validated 330,358 fast parallel-world assembly:

Binary location Class entries Notes
root layout 4,557 unshimmed classes visible to the root loader
spark-shared 0 default promotion emptied the shared shim bucket for this build
sparkXYZ 438 219 unique class paths, one copy under each of spark330 and spark358

The same assembly selected 4,557 class files and 4 non-class resources for root
promotion. Local noSnapshots builds on this branch validate the full OSS shim matrix for both Scala lines, excluding Databricks shims which require proprietary Databricks build images:

./build/buildall --clean --profile=noSnapshots --parallel=8 --option=-Ddist.jar.compress=false
./build/buildall --clean --scala213 --profile=noSnapshots --parallel=8 --option=-Ddist.jar.compress=false
Local noSnapshots build OSS Spark shims root-promoted common classes final root class entries final spark-shared class entries final sparkXYZ class entries unique sparkXYZ class paths
Scala 2.12 19 4,649 5,159 0 4,268 230
Scala 2.13 13 4,539 5,052 0 2,895 238

The root class-entry counts include copied JNI/UCX dependency classes. The
root-promoted common classes column is the plugin-side class promotion count
from default-unshimmed-spark-shared.txt.

For comparison, current OSS nightly snapshot jars still carry most
classes in shim-loader layout:

OSS nightly jar root class entries spark-shared entries sparkXYZ entries shim-layout entries unique shim-layout class paths
Scala 2.12 26.08.0-20260605.172952-9-cuda12 658 4,390 13,178 17,568 5,089
Scala 2.13 26.08.0-20260605.164124-9-cuda12 649 4,305 9,853 14,158 5,038

These nightly counts are jar-entry counts. sparkXYZ counts include one copy
per Spark shim; the unique class-path column deduplicates those copies by
removing the leading sparkXYZ/ or spark-shared/ prefix.

Build-Time Iteration Support

For repeated unshim analysis, build/buildall has a cheaper fast path:

./build/buildall --unshim-fast --parallel-world-only --profile=353,358 --parallel=2

In fast mode the build skips expensive Maven work, disables dist jar compression,
and ignores stale shim revision metadata caused by rapids.build.info.skip=true.
The revision mismatch is still printed for visibility.

Validation

Focused ScalaTests were rerun after fixing the CI-reported serialization failure:

  • MortgageSparkSuite and MortgageAdaptiveSparkSuite: 12 tests passed.
  • HashAggregateRetrySuite: 7 tests passed.

The serialization fix makes GpuHashAggregateMetrics serializable so
GpuHashAggregateExec.internalDoExecuteColumnar no longer captures a
non-serializable metrics holder in the Spark task closure.

Packaging validation for the inverted default-unshim logic:

./build/buildall --unshim-fast --parallel-world-only --profile=353,358 --parallel=2

Regression validation for the private build-info resource reported missing by CI:

./build/buildall --unshim-fast --parallel-world-only --profile=330,358 --parallel=2

Result:

  • Default root promotion completed with no classes missing required shared bitwise identity.
  • The 330,358 fast assembly produced dist/target/parallel-world/rapids4spark-private-version-info.properties at root, so root-loaded RapidsPluginUtils can read the private dependency build metadata.
  • The conservative dependency analyzer continues to write root-safe-spark-shared.txt diagnostics.

Additional local checks:

  • bash -n dist/scripts/binary-dedupe.sh
  • bash -O extglob -n build/buildall
  • python3 -m py_compile dist/scripts/build-unshim-parallel-world.py dist/scripts/analyze-parallel-world-deps.py
  • git diff --check

Follow-Up Direction

The separate Java-only Maven module idea remains a follow-up cleanup path, not
this PR's primary mechanism. Good follow-up pilots are still:

  1. a Java-only format module for com/nvidia/spark/rapids/format/*.java,
  2. a Java-only file I/O module once dependencies are clean, and
  3. vector/columnar Java classes only after dependencies such as GpuTypeShims
    are untangled.

Checklists

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

@gerashegalov

Copy link
Copy Markdown
Collaborator Author

build

@nvauto

nvauto commented May 25, 2026

Copy link
Copy Markdown
Collaborator

NOTE: release/26.06 has been created from main. Please retarget your PR to release/26.06 if it should be included in the release.

Comment thread scala2.13/sql-plugin-shims/pom.xml Outdated
<parent>
<groupId>com.nvidia</groupId>
<artifactId>rapids-4-spark-shim-deps-parent_2.13</artifactId>
<version>26.06.0-SNAPSHOT</version>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is still in draft, just a reminder:
If this one targets main, please update the all versions to 26.08. Thanks

@gerashegalov gerashegalov changed the title Codex/isolate sql plugin shims Minimize the amount of version-specific classes [databricks] Jun 6, 2026
@gerashegalov

Copy link
Copy Markdown
Collaborator Author

build

2 similar comments
@gerashegalov

Copy link
Copy Markdown
Collaborator Author

build

@gerashegalov

Copy link
Copy Markdown
Collaborator Author

build

@gerashegalov gerashegalov force-pushed the codex/isolate-sql-plugin-shims branch 2 times, most recently from c4dd209 to ca09fdc Compare June 7, 2026 14:24
@gerashegalov

Copy link
Copy Markdown
Collaborator Author

build

@gerashegalov gerashegalov force-pushed the codex/isolate-sql-plugin-shims branch 2 times, most recently from 3e6e9fe to 3cf1779 Compare June 7, 2026 19:32
@gerashegalov gerashegalov changed the base branch from main to codex/unshim-stack-03-delta-iceberg June 10, 2026 14:09
@gerashegalov

gerashegalov commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

Official GitHub stacked PRs are now enabled, but GitHub stacks cannot include fork-head PRs. I created the stack from upstream branch refs instead:

  1. Add default common unshim packaging flow #15025 - Add default common unshim packaging flow
  2. Add SQL plugin helper module wiring #15026 - Reshape SQL plugin into Java helper modules
  3. Adapt Delta and Iceberg to unshimmed helpers #15027 - Adapt Delta and Iceberg to unshimmed helpers
  4. Update UDF compiler for shared helper layout #15028 - Update UDF compiler for shared helper layout

This PR is now the old fork-head record for the same top branch tip and should no longer be the primary review target.

Signed-off-by: Gera Shegalov <gshegalov@nvidia.com>
Signed-off-by: Gera Shegalov <gshegalov@nvidia.com>
Signed-off-by: Gera Shegalov <gshegalov@nvidia.com>
Signed-off-by: Gera Shegalov <gshegalov@nvidia.com>
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-03-delta-iceberg branch 26 times, most recently from 2db9df3 to e557937 Compare June 13, 2026 12:20
@gerashegalov

Copy link
Copy Markdown
Collaborator Author

This big PR was split into Stack #15054

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants