Add default common unshim packaging flow by gerashegalov · Pull Request #15025 · NVIDIA/cudf-spark

gerashegalov · 2026-06-10T14:09:09Z

Related to #14834.

Description

This is the bottom layer and introduction for the unshim stack. The overall goal of the stack is to stop treating byte-for-byte identical common classes as shimmed artifacts just because they are emitted through the parallel-world build. Instead, common classes should land in the normal root jar layout by default, and only classes that truly need spark-shared packaging should remain there explicitly.

This PR adds that packaging policy and the tooling needed to make it auditable. The follow-up PRs then move code in themed batches so each review can focus on one class family or one caller migration at a time.

Why invert the model

Before this stack, the packaging flow was driven by explicit lists of classes that had been proven safe to unshim. That made every new common class suspicious by default: unless it was discovered, tested, and added to the unshim list, it stayed in spark-shared even when all shim builds produced identical bytecode.

After this PR, bitwise-identical common classes are promoted to the root layout by default. The exceptional case is now explicit: dist/keep-in-spark-shared.txt records patterns for classes that are identical but still must remain under spark-shared for compatibility or packaging reasons. New common code should stay unshimmed unless there is a concrete reason not to.

What changes in this PR

Adds the default-unshim packaging path used by dist/build/package-parallel-worlds.py and the parallel-world antrun packaging.
Adds dist/keep-in-spark-shared.txt as the explicit escape hatch for classes that must remain in spark-shared.
Adds dist/scripts/build-unshim-parallel-world.py to build/package a single-shim view cheaply while iterating on unshim candidates.
Adds dist/scripts/analyze-parallel-world-deps.py to inspect class-file dependencies and report remaining static paths from root/common classes to version-specific shim bytecode.
Extends dist/scripts/binary-dedupe.sh diagnostics so the packaging result can be explained from class-file evidence instead of by inspection of jar contents alone.
Optimizes build/buildall for repeated unshim iteration, including cheaper paths and -Ddist.jar.compress=false on the fast path.
Updates docs/dev/shims.md, docs/dev/shimplify.md, and dist/README.md to document the new model.
Preserves root resource promotion for build/version metadata files that must be visible from the final plugin jar root, including private dependency version-info resources.

Reviewer guidance

The important policy decision is in this PR: the default changes from "stay shimmed until listed" to "unshim when all emitted common bytecode is identical, unless excluded." Please review the packaging rules, the exclusion file semantics, and the analyzer/dedupe tooling with that in mind.

The large class movement is intentionally not in this PR. The rest of the stack applies the new model in smaller layers so reviewers can inspect each group without reading the entire migration at once.

Stack map

Add default common unshim packaging flow #15025: packaging policy, analyzer tooling, binary-dedupe diagnostics, buildall fast path, and docs.
Add SQL plugin helper module wiring #15026: Maven/module wiring for Java-friendly SQL plugin helper modules.
Move shared API and shuffle format helpers to Java modules #15040 and Move Hadoop file I/O helpers to Java module #15030: shared API/format helpers and Hadoop file I/O helpers.
Add Spark 3.3.0 SQL shim module sources #15043, Add Spark 3.3.1 through 3.4 SQL shim module sources #15041, and Add Spark 3.5 and 4.0 SQL shim module sources #15042: Spark-version SQL shim module source population.
Move column vector and host memory helpers to the columnar module #15031, Move columnar scalar utilities to Java modules #15044, Move columnar runtime config helpers to Java #15032, Move columnar table value helpers to Java modules #15045, Move profile and write stats to the columnar module #15046, and Move aggregate and shuffle stats to the columnar module #15047: columnar/vector/runtime/table/stat helper moves.
Adapt rule and plugin metadata callers to Java helpers #15048, Adapt Kudo and file scan callers to Java helpers #15049, Adapt shuffle and spill callers to Java helpers #15050, Adapt core execution callers to columnar helpers #15051, Adapt SQL RAPIDS callers to columnar helpers #15033, Adapt remaining SQL plugin callers to Java helpers #15034, and Move common shim helpers to shared sources #15052: caller migration and common shim helper moves.
Remove old Spark 3.3.0 shim sources #15035, Remove old Spark 3.3 DBR shim sources #15053, Remove old Spark 3.3.1 through 3.4 shim sources #15036, Remove old Spark 3.5 shim sources #15037, and Remove old Spark 4 shim sources and update tests #15038: old shim source cleanup, with the DBR cleanup isolated in Remove old Spark 3.3 DBR shim sources #15053.
Adapt Delta and Iceberg to unshimmed helpers #15027: Delta and Iceberg follow-up adaptation.
Update UDF compiler for shared helper layout #15028: UDF compiler and documentation follow-up adaptation.

Testing and validation notes

The split stack was verified to be tree-equivalent to the pre-split stack top; the split of the old large Add SQL plugin helper module wiring #15026 is also tree-equivalent to the previous Add SQL plugin helper module wiring #15026 branch top.
Local noSnapshots packaging checks during development showed zero final spark-shared/*.class entries for both Scala 2.12 and Scala 2.13 OSS shim sets.
Databricks shims are not locally buildable, so DBR-specific source cleanup is isolated in Remove old Spark 3.3 DBR shim sources #15053 for Databricks-node validation.

Checklists

Documentation

Updated for new or modified user-facing features or behaviors
No user-facing change

Testing

Added or modified tests to cover new code paths
Covered by existing tests
(Covered by packaging/noSnapshots validation and the existing parallel-world packaging checks described above.)
Not required

Performance

Tests ran and results are added in the PR description
Issue filed with a link in the PR description
Not required

greptile-apps · 2026-06-13T12:57:57Z

Greptile Summary

This PR inverts the parallel-world packaging policy: bitwise-identical common classes are now promoted to the root jar layout by default, rather than requiring explicit allowlist entries. The keep-in-spark-shared.txt escape hatch holds the rare classes that must remain in spark-shared even when identical across shims.

Policy change in binary-dedupe.sh: replaces the old explicit-allowlist model with default promotion of all SHA-identical .class files; adds dedupe caching, a single-shim fast path, and keep-in-spark-shared.txt filtering.
New tooling: analyze-parallel-world-deps.py (JVM class-file dependency graph and SCC analysis) and build-unshim-parallel-world.py (Maven-free parallel-world assembly for rapid iteration), both wired into buildall via four new --unshim-* flags.
Allowlist cleanup: unshimmed-common-from-single-shim.txt is trimmed from ~50 class entries to a handful of non-class root resources; the removed classes are now promoted automatically.

Confidence Score: 4/5

Safe to merge; all changes are build/packaging tooling with no runtime bytecode impact.

The packaging logic inversion is well-reasoned and the escape-hatch file prevents regressions. The recursive Tarjan SCC in the analyzer can segfault when processing a large dist jar with many class nodes, which would break the analysis workflow for developers. The fast-path script picks a different representative shim (highest vs. lowest) than the Maven build path for from_single_shim resources, which is a latent inconsistency. The mtime-based jar cache can serve stale content silently after a cp -p update. These are all isolated to developer tooling; production artifact generation goes through the Maven path which is unchanged in behavior.

dist/scripts/analyze-parallel-world-deps.py (recursive SCC), dist/scripts/build-unshim-parallel-world.py (root_buildver selection and mtime cache), dist/scripts/binary-dedupe.sh (keep_in_spark_shared per-call file reads)

Important Files Changed

Filename	Overview
dist/scripts/analyze-parallel-world-deps.py	New 617-line tool: parses JVM class files from the dist jar/parallel-world directory, builds a dependency graph, and reports which spark-shared classes can be safely promoted to the root layout. The recursive Tarjan SCC implementation may segfault on large class graphs.
dist/scripts/build-unshim-parallel-world.py	New 292-line fast-path script for assembling the parallel-world directory without a full Maven dist invocation. Includes zip-slip protection in safe_extract. The root_buildver for from_single_shim sourcing uses the highest buildver, differing from the Maven-driven path which uses the first/lowest.
dist/scripts/binary-dedupe.sh	Refactored to promote ALL bitwise-identical spark-shared classes to the root layout by default, replacing the old explicit allowlist model. Adds dedupe caching, single-shim fast path, and the keep-in-spark-shared exclusion filter. The keep_in_spark_shared function re-reads the patterns file per class call.
build/buildall	Adds four new flags (--unshim-fast, --parallel-world-only, --unshim-reuse-built-jars, --unshim-allowlist-only) to speed up repeated unshim iteration. Logic gates are correctly checked; dependency validation between flags is enforced at startup.
dist/build/package-parallel-worlds.py	Moves file-list reads outside the per-buildver loop and adds an UNSHIM_FAST optimization branch in select_matching_members; no logic changes to the non-fast path.
dist/maven-antrun/build-parallel-worlds.xml	Threads two new env-var inputs (KEEP_IN_SPARK_SHARED_TXT, UNSHIM_ANALYZER_SCRIPT) to binary-dedupe.sh and makes the remove-dependencies-from-pom target skippable via a new Maven property.
dist/unshimmed-common-from-single-shim.txt	Drastically slimmed: all explicit class entries removed (they are now promoted automatically by binary-dedupe), retaining only non-class root resources (META-INF files, private version-info properties, Python worker files).
dist/keep-in-spark-shared.txt	New escape-hatch file; currently empty. Patterns added here prevent otherwise-promotable bitwise-identical classes from being moved to the root layout.
docs/dev/shims.md	Adds ~80 lines of new documentation covering the one-way shim module boundary, the default-unshim promotion model, and usage examples for the dependency analyzer.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Per-shim Maven builds\nsql-plugin-api + aggregator jars] --> B[parallel-world assembly\nbuild-parallel-worlds.xml / build-unshim-parallel-world.py]
    B --> C[binary-dedupe.sh]
    C --> D{SHA-identical\nacross all shims?}
    D -- Yes --> E{In keep-in-spark-shared.txt?}
    D -- No --> F[spark-shared/\nversion-specific copy]
    E -- Yes --> F
    E -- No --> G[Default promotion\nto root layout]
    C --> H[analyze-parallel-world-deps.py\nwrite ROOT_SAFE_SPARK_SHARED_TXT]
    H -.->|diagnostic only| G
    G --> I[Final dist jar\nroot-layout classes]
    F --> J[Final dist jar\nspark-shared/ classes]
    K[unshimmed-common-from-single-shim.txt\nMETA-INF, .py, version-info] --> L[Explicit root promotion\nnon-class resources]
    L --> I
    M[unshimmed-from-each-spark3xx.txt\nper-shim root resources] --> I

_{Reviews (1): Last reviewed commit: "Add default common unshim packaging flow" | Re-trigger Greptile}

greptile-apps · 2026-06-13T12:58:01Z

+    queue = collections.deque([(start, [start])])
+    seen = {start}
+    while queue:
+        node, path = queue.popleft()
+        if node != start and is_version_node(node):
+            return path
+        for next_node in sorted(graph[node]):
+            if next_node not in seen:
+                seen.add(next_node)
+                queue.append((next_node, path + [next_node]))
+    return None
+
+
+def tarjan_scc(graph):
+    sys.setrecursionlimit(max(sys.getrecursionlimit(), len(graph) * 2 + 1000))
+
+    index = 0
+    stack = []
+    on_stack = set()
+    indexes = {}
+    lowlinks = {}
+    components = []
+
+    def strongconnect(node):
+        nonlocal index
+        indexes[node] = index
+        lowlinks[node] = index
+        index += 1
+        stack.append(node)
+        on_stack.add(node)
+
+        for next_node in graph[node]:
+            if next_node not in indexes:
+                strongconnect(next_node)
+                lowlinks[node] = min(lowlinks[node], lowlinks[next_node])
+            elif next_node in on_stack:
+                lowlinks[node] = min(lowlinks[node], indexes[next_node])
+
+        if lowlinks[node] == indexes[node]:
+            component = []
+            while True:
+                item = stack.pop()
+                on_stack.remove(item)
+                component.append(item)
+                if item == node:
+                    break
+            components.append(component)
+
+    for node in graph:
+        if node not in indexes:
+            strongconnect(node)
+    return components


Recursive Tarjan SCC may segfault on large class graphs

tarjan_scc raises sys.setrecursionlimit to len(graph) * 2 + 1000, but CPython's thread stack is typically 8 MB. A dist jar with 10,000+ class nodes (not unusual) could set the limit to 21,000; at ~500–1,000 bytes per Python frame that exceeds 8 MB and causes an OS-level stack overflow (segfault, not a catchable RecursionError). The standard fix is to convert strongconnect to an explicit stack-based iterative DFS.

greptile-apps · 2026-06-13T12:58:02Z

+    sorted_buildvers = sorted(buildvers, reverse=True)
+    root_buildver = sorted_buildvers[0]


root_buildver selection is inconsistent with package-parallel-worlds.py

sorted(buildvers, reverse=True)[0] picks the highest buildver to source from_single_shim resources (e.g., rapids4spark-private-version-info.properties, Python scripts). The Maven-driven script (package-parallel-worlds.py) uses buildver_list[0], which reflects the Maven property order and is historically the lowest version. When both paths are used for the same branch the two build modes may produce output assembled from different representative shims, which could silently diverge if any from_single_shim resource ever differs between the lowest and highest shim jars.

greptile-apps · 2026-06-13T12:58:03Z

+def jar_signature(jar_path):
+    stat = jar_path.stat()
+    return "\n".join((
+        "path=%s" % jar_path,
+        "size=%s" % stat.st_size,
+        "mtime_ns=%s" % stat.st_mtime_ns,
+        "",
+    ))
+
+
+def dedupe_cache_key(base_dir, scala_binary_version, project_version, buildvers):
+    parts = []
+    for buildver in sorted(buildvers, reverse=True):
+        for artifact in ARTIFACTS:
+            jar_path = artifact_jar(
+                base_dir, artifact, scala_binary_version, project_version, buildver)


mtime-based cache key can produce stale analysis output silently

jar_signature keys the extraction cache on st_size and st_mtime_ns. In buildall, refresh_fast_aggregator_jar uses cp -p, which preserves the source mtime. If the aggregator jar is rebuilt with different bytecode but cp -p copies it over a cache entry that matches the preserved timestamp, the cached extraction is returned without invalidation, and the analyzer sees stale class content. Using a content hash (e.g., SHA-1 of the jar) instead of mtime would make the cache reliably reflect actual content.

greptile-apps · 2026-06-13T12:58:04Z

+function keep_in_spark_shared() {
+  set -e
+  local class_file="$1"
+  local keep_patterns_txt="${KEEP_IN_SPARK_SHARED_TXT:-}"
+  [[ -n "$keep_patterns_txt" ]] || return 1
+  [[ -f "$keep_patterns_txt" ]] || {
+    echo >&2 "Keep-in-spark-shared list does not exist: $keep_patterns_txt"
+    exit 255
+  }
+
+  local pattern
+  while IFS= read -r pattern; do
+    [[ -n "$pattern" ]] || continue
+    [[ "$pattern" =~ ^[[:space:]]*# ]] && continue
+    # shellcheck disable=SC2053
+    if [[ "$class_file" == $pattern ]]; then
+      return 0
+    fi
+  done < "$keep_patterns_txt"
+  return 1
+}


keep_in_spark_shared re-reads the patterns file for every class

filter_keep_in_spark_shared calls keep_in_spark_shared once per class entry, and each call opens and re-reads $KEEP_IN_SPARK_SHARED_TXT. Currently the file is empty so there is no impact, but once the exclusion list grows this becomes O(N × M) file I/O — one full read per class for each pattern in the list. Caching the patterns in an array before the loop (the same way unshimmed_class_needs_shared_identity uses an inline [[ ]] list) would avoid the repeated reads.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

gerashegalov · 2026-06-13T13:01:22Z

build

gerashegalov mentioned this pull request Jun 10, 2026

Update UDF compiler for shared helper layout #15028

Open

8 tasks

gerashegalov changed the title ~~codex/unshim stack 01 packaging~~ Add default common unshim packaging flow Jun 10, 2026

This was referenced Jun 10, 2026

Adapt Delta and Iceberg to unshimmed helpers #15027

Open

Add SQL plugin helper module wiring #15026

Open

Minimize the amount of version-specific classes [databricks] #14834

Closed

gerashegalov force-pushed the codex/unshim-stack-01-packaging branch from 6d223b7 to a0d7661 Compare June 10, 2026 15:08

gerashegalov self-assigned this Jun 11, 2026

Add default common unshim packaging flow

d303bb3

gerashegalov force-pushed the codex/unshim-stack-01-packaging branch from a0d7661 to d303bb3 Compare June 13, 2026 12:13

gerashegalov marked this pull request as ready for review June 13, 2026 12:48

gerashegalov requested a review from a team as a code owner June 13, 2026 12:48

greptile-apps Bot reviewed Jun 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add default common unshim packaging flow#15025

Add default common unshim packaging flow#15025
gerashegalov wants to merge 1 commit into
mainfrom
codex/unshim-stack-01-packaging

gerashegalov commented Jun 10, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 13, 2026

Uh oh!

greptile-apps Bot Jun 13, 2026

Uh oh!

greptile-apps Bot Jun 13, 2026

Uh oh!

greptile-apps Bot Jun 13, 2026

Uh oh!

greptile-apps Bot Jun 13, 2026

Uh oh!

gerashegalov commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		sorted_buildvers = sorted(buildvers, reverse=True)
		root_buildver = sorted_buildvers[0]

Uh oh!

Conversation

gerashegalov commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why invert the model

What changes in this PR

Reviewer guidance

Stack map

Testing and validation notes

Checklists

Uh oh!

greptile-apps Bot commented Jun 13, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

gerashegalov commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gerashegalov commented Jun 10, 2026 •

edited

Loading