Skip to content

Add default common unshim packaging flow#15025

Open
gerashegalov wants to merge 1 commit into
mainfrom
codex/unshim-stack-01-packaging
Open

Add default common unshim packaging flow#15025
gerashegalov wants to merge 1 commit into
mainfrom
codex/unshim-stack-01-packaging

Conversation

@gerashegalov

@gerashegalov gerashegalov commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Related to #14834.

Description

This is the bottom layer and introduction for the unshim stack. The overall goal of the stack is to stop treating byte-for-byte identical common classes as shimmed artifacts just because they are emitted through the parallel-world build. Instead, common classes should land in the normal root jar layout by default, and only classes that truly need spark-shared packaging should remain there explicitly.

This PR adds that packaging policy and the tooling needed to make it auditable. The follow-up PRs then move code in themed batches so each review can focus on one class family or one caller migration at a time.

Why invert the model

Before this stack, the packaging flow was driven by explicit lists of classes that had been proven safe to unshim. That made every new common class suspicious by default: unless it was discovered, tested, and added to the unshim list, it stayed in spark-shared even when all shim builds produced identical bytecode.

After this PR, bitwise-identical common classes are promoted to the root layout by default. The exceptional case is now explicit: dist/keep-in-spark-shared.txt records patterns for classes that are identical but still must remain under spark-shared for compatibility or packaging reasons. New common code should stay unshimmed unless there is a concrete reason not to.

What changes in this PR

  • Adds the default-unshim packaging path used by dist/build/package-parallel-worlds.py and the parallel-world antrun packaging.
  • Adds dist/keep-in-spark-shared.txt as the explicit escape hatch for classes that must remain in spark-shared.
  • Adds dist/scripts/build-unshim-parallel-world.py to build/package a single-shim view cheaply while iterating on unshim candidates.
  • Adds dist/scripts/analyze-parallel-world-deps.py to inspect class-file dependencies and report remaining static paths from root/common classes to version-specific shim bytecode.
  • Extends dist/scripts/binary-dedupe.sh diagnostics so the packaging result can be explained from class-file evidence instead of by inspection of jar contents alone.
  • Optimizes build/buildall for repeated unshim iteration, including cheaper paths and -Ddist.jar.compress=false on the fast path.
  • Updates docs/dev/shims.md, docs/dev/shimplify.md, and dist/README.md to document the new model.
  • Preserves root resource promotion for build/version metadata files that must be visible from the final plugin jar root, including private dependency version-info resources.

Reviewer guidance

The important policy decision is in this PR: the default changes from "stay shimmed until listed" to "unshim when all emitted common bytecode is identical, unless excluded." Please review the packaging rules, the exclusion file semantics, and the analyzer/dedupe tooling with that in mind.

The large class movement is intentionally not in this PR. The rest of the stack applies the new model in smaller layers so reviewers can inspect each group without reading the entire migration at once.

Stack map

Testing and validation notes

Checklists

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
    (Covered by packaging/noSnapshots validation and the existing parallel-world packaging checks described above.)
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

@gerashegalov gerashegalov changed the title codex/unshim stack 01 packaging Add default common unshim packaging flow Jun 10, 2026
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-01-packaging branch from 6d223b7 to a0d7661 Compare June 10, 2026 15:08
@gerashegalov gerashegalov self-assigned this Jun 11, 2026
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-01-packaging branch from a0d7661 to d303bb3 Compare June 13, 2026 12:13
@gerashegalov gerashegalov marked this pull request as ready for review June 13, 2026 12:48
@gerashegalov gerashegalov requested a review from a team as a code owner June 13, 2026 12:48
@greptile-apps

greptile-apps Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR inverts the parallel-world packaging policy: bitwise-identical common classes are now promoted to the root jar layout by default, rather than requiring explicit allowlist entries. The keep-in-spark-shared.txt escape hatch holds the rare classes that must remain in spark-shared even when identical across shims.

  • Policy change in binary-dedupe.sh: replaces the old explicit-allowlist model with default promotion of all SHA-identical .class files; adds dedupe caching, a single-shim fast path, and keep-in-spark-shared.txt filtering.
  • New tooling: analyze-parallel-world-deps.py (JVM class-file dependency graph and SCC analysis) and build-unshim-parallel-world.py (Maven-free parallel-world assembly for rapid iteration), both wired into buildall via four new --unshim-* flags.
  • Allowlist cleanup: unshimmed-common-from-single-shim.txt is trimmed from ~50 class entries to a handful of non-class root resources; the removed classes are now promoted automatically.

Confidence Score: 4/5

Safe to merge; all changes are build/packaging tooling with no runtime bytecode impact.

The packaging logic inversion is well-reasoned and the escape-hatch file prevents regressions. The recursive Tarjan SCC in the analyzer can segfault when processing a large dist jar with many class nodes, which would break the analysis workflow for developers. The fast-path script picks a different representative shim (highest vs. lowest) than the Maven build path for from_single_shim resources, which is a latent inconsistency. The mtime-based jar cache can serve stale content silently after a cp -p update. These are all isolated to developer tooling; production artifact generation goes through the Maven path which is unchanged in behavior.

dist/scripts/analyze-parallel-world-deps.py (recursive SCC), dist/scripts/build-unshim-parallel-world.py (root_buildver selection and mtime cache), dist/scripts/binary-dedupe.sh (keep_in_spark_shared per-call file reads)

Important Files Changed

Filename Overview
dist/scripts/analyze-parallel-world-deps.py New 617-line tool: parses JVM class files from the dist jar/parallel-world directory, builds a dependency graph, and reports which spark-shared classes can be safely promoted to the root layout. The recursive Tarjan SCC implementation may segfault on large class graphs.
dist/scripts/build-unshim-parallel-world.py New 292-line fast-path script for assembling the parallel-world directory without a full Maven dist invocation. Includes zip-slip protection in safe_extract. The root_buildver for from_single_shim sourcing uses the highest buildver, differing from the Maven-driven path which uses the first/lowest.
dist/scripts/binary-dedupe.sh Refactored to promote ALL bitwise-identical spark-shared classes to the root layout by default, replacing the old explicit allowlist model. Adds dedupe caching, single-shim fast path, and the keep-in-spark-shared exclusion filter. The keep_in_spark_shared function re-reads the patterns file per class call.
build/buildall Adds four new flags (--unshim-fast, --parallel-world-only, --unshim-reuse-built-jars, --unshim-allowlist-only) to speed up repeated unshim iteration. Logic gates are correctly checked; dependency validation between flags is enforced at startup.
dist/build/package-parallel-worlds.py Moves file-list reads outside the per-buildver loop and adds an UNSHIM_FAST optimization branch in select_matching_members; no logic changes to the non-fast path.
dist/maven-antrun/build-parallel-worlds.xml Threads two new env-var inputs (KEEP_IN_SPARK_SHARED_TXT, UNSHIM_ANALYZER_SCRIPT) to binary-dedupe.sh and makes the remove-dependencies-from-pom target skippable via a new Maven property.
dist/unshimmed-common-from-single-shim.txt Drastically slimmed: all explicit class entries removed (they are now promoted automatically by binary-dedupe), retaining only non-class root resources (META-INF files, private version-info properties, Python worker files).
dist/keep-in-spark-shared.txt New escape-hatch file; currently empty. Patterns added here prevent otherwise-promotable bitwise-identical classes from being moved to the root layout.
docs/dev/shims.md Adds ~80 lines of new documentation covering the one-way shim module boundary, the default-unshim promotion model, and usage examples for the dependency analyzer.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Per-shim Maven builds\nsql-plugin-api + aggregator jars] --> B[parallel-world assembly\nbuild-parallel-worlds.xml / build-unshim-parallel-world.py]
    B --> C[binary-dedupe.sh]
    C --> D{SHA-identical\nacross all shims?}
    D -- Yes --> E{In keep-in-spark-shared.txt?}
    D -- No --> F[spark-shared/\nversion-specific copy]
    E -- Yes --> F
    E -- No --> G[Default promotion\nto root layout]
    C --> H[analyze-parallel-world-deps.py\nwrite ROOT_SAFE_SPARK_SHARED_TXT]
    H -.->|diagnostic only| G
    G --> I[Final dist jar\nroot-layout classes]
    F --> J[Final dist jar\nspark-shared/ classes]
    K[unshimmed-common-from-single-shim.txt\nMETA-INF, .py, version-info] --> L[Explicit root promotion\nnon-class resources]
    L --> I
    M[unshimmed-from-each-spark3xx.txt\nper-shim root resources] --> I
Loading

Reviews (1): Last reviewed commit: "Add default common unshim packaging flow" | Re-trigger Greptile

Comment on lines +253 to +304
queue = collections.deque([(start, [start])])
seen = {start}
while queue:
node, path = queue.popleft()
if node != start and is_version_node(node):
return path
for next_node in sorted(graph[node]):
if next_node not in seen:
seen.add(next_node)
queue.append((next_node, path + [next_node]))
return None


def tarjan_scc(graph):
sys.setrecursionlimit(max(sys.getrecursionlimit(), len(graph) * 2 + 1000))

index = 0
stack = []
on_stack = set()
indexes = {}
lowlinks = {}
components = []

def strongconnect(node):
nonlocal index
indexes[node] = index
lowlinks[node] = index
index += 1
stack.append(node)
on_stack.add(node)

for next_node in graph[node]:
if next_node not in indexes:
strongconnect(next_node)
lowlinks[node] = min(lowlinks[node], lowlinks[next_node])
elif next_node in on_stack:
lowlinks[node] = min(lowlinks[node], indexes[next_node])

if lowlinks[node] == indexes[node]:
component = []
while True:
item = stack.pop()
on_stack.remove(item)
component.append(item)
if item == node:
break
components.append(component)

for node in graph:
if node not in indexes:
strongconnect(node)
return components

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Recursive Tarjan SCC may segfault on large class graphs

tarjan_scc raises sys.setrecursionlimit to len(graph) * 2 + 1000, but CPython's thread stack is typically 8 MB. A dist jar with 10,000+ class nodes (not unusual) could set the limit to 21,000; at ~500–1,000 bytes per Python frame that exceeds 8 MB and causes an OS-level stack overflow (segfault, not a catchable RecursionError). The standard fix is to convert strongconnect to an explicit stack-based iterative DFS.

Comment on lines +175 to +176
sorted_buildvers = sorted(buildvers, reverse=True)
root_buildver = sorted_buildvers[0]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 root_buildver selection is inconsistent with package-parallel-worlds.py

sorted(buildvers, reverse=True)[0] picks the highest buildver to source from_single_shim resources (e.g., rapids4spark-private-version-info.properties, Python scripts). The Maven-driven script (package-parallel-worlds.py) uses buildver_list[0], which reflects the Maven property order and is historically the lowest version. When both paths are used for the same branch the two build modes may produce output assembled from different representative shims, which could silently diverge if any from_single_shim resource ever differs between the lowest and highest shim jars.

Comment on lines +95 to +110
def jar_signature(jar_path):
stat = jar_path.stat()
return "\n".join((
"path=%s" % jar_path,
"size=%s" % stat.st_size,
"mtime_ns=%s" % stat.st_mtime_ns,
"",
))


def dedupe_cache_key(base_dir, scala_binary_version, project_version, buildvers):
parts = []
for buildver in sorted(buildvers, reverse=True):
for artifact in ARTIFACTS:
jar_path = artifact_jar(
base_dir, artifact, scala_binary_version, project_version, buildver)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 mtime-based cache key can produce stale analysis output silently

jar_signature keys the extraction cache on st_size and st_mtime_ns. In buildall, refresh_fast_aggregator_jar uses cp -p, which preserves the source mtime. If the aggregator jar is rebuilt with different bytecode but cp -p copies it over a cache entry that matches the preserved timestamp, the cached extraction is returned without invalidation, and the analyzer sees stale class content. Using a content hash (e.g., SHA-1 of the jar) instead of mtime would make the cache reliably reflect actual content.

Comment on lines +210 to +230
function keep_in_spark_shared() {
set -e
local class_file="$1"
local keep_patterns_txt="${KEEP_IN_SPARK_SHARED_TXT:-}"
[[ -n "$keep_patterns_txt" ]] || return 1
[[ -f "$keep_patterns_txt" ]] || {
echo >&2 "Keep-in-spark-shared list does not exist: $keep_patterns_txt"
exit 255
}

local pattern
while IFS= read -r pattern; do
[[ -n "$pattern" ]] || continue
[[ "$pattern" =~ ^[[:space:]]*# ]] && continue
# shellcheck disable=SC2053
if [[ "$class_file" == $pattern ]]; then
return 0
fi
done < "$keep_patterns_txt"
return 1
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 keep_in_spark_shared re-reads the patterns file for every class

filter_keep_in_spark_shared calls keep_in_spark_shared once per class entry, and each call opens and re-reads $KEEP_IN_SPARK_SHARED_TXT. Currently the file is empty so there is no impact, but once the exclusion list grows this becomes O(N × M) file I/O — one full read per class for each pattern in the list. Caching the patterns in an array before the loop (the same way unshimmed_class_needs_shared_identity uses an inline [[ ]] list) would avoid the repeated reads.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@gerashegalov

Copy link
Copy Markdown
Collaborator Author

build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants