Add default common unshim packaging flow#15025
Conversation
6d223b7 to
a0d7661
Compare
a0d7661 to
d303bb3
Compare
Greptile SummaryThis PR inverts the parallel-world packaging policy: bitwise-identical common classes are now promoted to the root jar layout by default, rather than requiring explicit allowlist entries. The
Confidence Score: 4/5Safe to merge; all changes are build/packaging tooling with no runtime bytecode impact. The packaging logic inversion is well-reasoned and the escape-hatch file prevents regressions. The recursive Tarjan SCC in the analyzer can segfault when processing a large dist jar with many class nodes, which would break the analysis workflow for developers. The fast-path script picks a different representative shim (highest vs. lowest) than the Maven build path for from_single_shim resources, which is a latent inconsistency. The mtime-based jar cache can serve stale content silently after a cp -p update. These are all isolated to developer tooling; production artifact generation goes through the Maven path which is unchanged in behavior. dist/scripts/analyze-parallel-world-deps.py (recursive SCC), dist/scripts/build-unshim-parallel-world.py (root_buildver selection and mtime cache), dist/scripts/binary-dedupe.sh (keep_in_spark_shared per-call file reads) Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Per-shim Maven builds\nsql-plugin-api + aggregator jars] --> B[parallel-world assembly\nbuild-parallel-worlds.xml / build-unshim-parallel-world.py]
B --> C[binary-dedupe.sh]
C --> D{SHA-identical\nacross all shims?}
D -- Yes --> E{In keep-in-spark-shared.txt?}
D -- No --> F[spark-shared/\nversion-specific copy]
E -- Yes --> F
E -- No --> G[Default promotion\nto root layout]
C --> H[analyze-parallel-world-deps.py\nwrite ROOT_SAFE_SPARK_SHARED_TXT]
H -.->|diagnostic only| G
G --> I[Final dist jar\nroot-layout classes]
F --> J[Final dist jar\nspark-shared/ classes]
K[unshimmed-common-from-single-shim.txt\nMETA-INF, .py, version-info] --> L[Explicit root promotion\nnon-class resources]
L --> I
M[unshimmed-from-each-spark3xx.txt\nper-shim root resources] --> I
Reviews (1): Last reviewed commit: "Add default common unshim packaging flow" | Re-trigger Greptile |
| queue = collections.deque([(start, [start])]) | ||
| seen = {start} | ||
| while queue: | ||
| node, path = queue.popleft() | ||
| if node != start and is_version_node(node): | ||
| return path | ||
| for next_node in sorted(graph[node]): | ||
| if next_node not in seen: | ||
| seen.add(next_node) | ||
| queue.append((next_node, path + [next_node])) | ||
| return None | ||
|
|
||
|
|
||
| def tarjan_scc(graph): | ||
| sys.setrecursionlimit(max(sys.getrecursionlimit(), len(graph) * 2 + 1000)) | ||
|
|
||
| index = 0 | ||
| stack = [] | ||
| on_stack = set() | ||
| indexes = {} | ||
| lowlinks = {} | ||
| components = [] | ||
|
|
||
| def strongconnect(node): | ||
| nonlocal index | ||
| indexes[node] = index | ||
| lowlinks[node] = index | ||
| index += 1 | ||
| stack.append(node) | ||
| on_stack.add(node) | ||
|
|
||
| for next_node in graph[node]: | ||
| if next_node not in indexes: | ||
| strongconnect(next_node) | ||
| lowlinks[node] = min(lowlinks[node], lowlinks[next_node]) | ||
| elif next_node in on_stack: | ||
| lowlinks[node] = min(lowlinks[node], indexes[next_node]) | ||
|
|
||
| if lowlinks[node] == indexes[node]: | ||
| component = [] | ||
| while True: | ||
| item = stack.pop() | ||
| on_stack.remove(item) | ||
| component.append(item) | ||
| if item == node: | ||
| break | ||
| components.append(component) | ||
|
|
||
| for node in graph: | ||
| if node not in indexes: | ||
| strongconnect(node) | ||
| return components |
There was a problem hiding this comment.
Recursive Tarjan SCC may segfault on large class graphs
tarjan_scc raises sys.setrecursionlimit to len(graph) * 2 + 1000, but CPython's thread stack is typically 8 MB. A dist jar with 10,000+ class nodes (not unusual) could set the limit to 21,000; at ~500–1,000 bytes per Python frame that exceeds 8 MB and causes an OS-level stack overflow (segfault, not a catchable RecursionError). The standard fix is to convert strongconnect to an explicit stack-based iterative DFS.
| sorted_buildvers = sorted(buildvers, reverse=True) | ||
| root_buildver = sorted_buildvers[0] |
There was a problem hiding this comment.
root_buildver selection is inconsistent with package-parallel-worlds.py
sorted(buildvers, reverse=True)[0] picks the highest buildver to source from_single_shim resources (e.g., rapids4spark-private-version-info.properties, Python scripts). The Maven-driven script (package-parallel-worlds.py) uses buildver_list[0], which reflects the Maven property order and is historically the lowest version. When both paths are used for the same branch the two build modes may produce output assembled from different representative shims, which could silently diverge if any from_single_shim resource ever differs between the lowest and highest shim jars.
| def jar_signature(jar_path): | ||
| stat = jar_path.stat() | ||
| return "\n".join(( | ||
| "path=%s" % jar_path, | ||
| "size=%s" % stat.st_size, | ||
| "mtime_ns=%s" % stat.st_mtime_ns, | ||
| "", | ||
| )) | ||
|
|
||
|
|
||
| def dedupe_cache_key(base_dir, scala_binary_version, project_version, buildvers): | ||
| parts = [] | ||
| for buildver in sorted(buildvers, reverse=True): | ||
| for artifact in ARTIFACTS: | ||
| jar_path = artifact_jar( | ||
| base_dir, artifact, scala_binary_version, project_version, buildver) |
There was a problem hiding this comment.
mtime-based cache key can produce stale analysis output silently
jar_signature keys the extraction cache on st_size and st_mtime_ns. In buildall, refresh_fast_aggregator_jar uses cp -p, which preserves the source mtime. If the aggregator jar is rebuilt with different bytecode but cp -p copies it over a cache entry that matches the preserved timestamp, the cached extraction is returned without invalidation, and the analyzer sees stale class content. Using a content hash (e.g., SHA-1 of the jar) instead of mtime would make the cache reliably reflect actual content.
| function keep_in_spark_shared() { | ||
| set -e | ||
| local class_file="$1" | ||
| local keep_patterns_txt="${KEEP_IN_SPARK_SHARED_TXT:-}" | ||
| [[ -n "$keep_patterns_txt" ]] || return 1 | ||
| [[ -f "$keep_patterns_txt" ]] || { | ||
| echo >&2 "Keep-in-spark-shared list does not exist: $keep_patterns_txt" | ||
| exit 255 | ||
| } | ||
|
|
||
| local pattern | ||
| while IFS= read -r pattern; do | ||
| [[ -n "$pattern" ]] || continue | ||
| [[ "$pattern" =~ ^[[:space:]]*# ]] && continue | ||
| # shellcheck disable=SC2053 | ||
| if [[ "$class_file" == $pattern ]]; then | ||
| return 0 | ||
| fi | ||
| done < "$keep_patterns_txt" | ||
| return 1 | ||
| } |
There was a problem hiding this comment.
keep_in_spark_shared re-reads the patterns file for every class
filter_keep_in_spark_shared calls keep_in_spark_shared once per class entry, and each call opens and re-reads $KEEP_IN_SPARK_SHARED_TXT. Currently the file is empty so there is no impact, but once the exclusion list grows this becomes O(N × M) file I/O — one full read per class for each pattern in the list. Caching the patterns in an array before the loop (the same way unshimmed_class_needs_shared_identity uses an inline [[ ]] list) would avoid the repeated reads.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
|
build |
Related to #14834.
Description
This is the bottom layer and introduction for the unshim stack. The overall goal of the stack is to stop treating byte-for-byte identical common classes as shimmed artifacts just because they are emitted through the parallel-world build. Instead, common classes should land in the normal root jar layout by default, and only classes that truly need
spark-sharedpackaging should remain there explicitly.This PR adds that packaging policy and the tooling needed to make it auditable. The follow-up PRs then move code in themed batches so each review can focus on one class family or one caller migration at a time.
Why invert the model
Before this stack, the packaging flow was driven by explicit lists of classes that had been proven safe to unshim. That made every new common class suspicious by default: unless it was discovered, tested, and added to the unshim list, it stayed in
spark-sharedeven when all shim builds produced identical bytecode.After this PR, bitwise-identical common classes are promoted to the root layout by default. The exceptional case is now explicit:
dist/keep-in-spark-shared.txtrecords patterns for classes that are identical but still must remain underspark-sharedfor compatibility or packaging reasons. New common code should stay unshimmed unless there is a concrete reason not to.What changes in this PR
dist/build/package-parallel-worlds.pyand the parallel-world antrun packaging.dist/keep-in-spark-shared.txtas the explicit escape hatch for classes that must remain inspark-shared.dist/scripts/build-unshim-parallel-world.pyto build/package a single-shim view cheaply while iterating on unshim candidates.dist/scripts/analyze-parallel-world-deps.pyto inspect class-file dependencies and report remaining static paths from root/common classes to version-specific shim bytecode.dist/scripts/binary-dedupe.shdiagnostics so the packaging result can be explained from class-file evidence instead of by inspection of jar contents alone.build/buildallfor repeated unshim iteration, including cheaper paths and-Ddist.jar.compress=falseon the fast path.docs/dev/shims.md,docs/dev/shimplify.md, anddist/README.mdto document the new model.Reviewer guidance
The important policy decision is in this PR: the default changes from "stay shimmed until listed" to "unshim when all emitted common bytecode is identical, unless excluded." Please review the packaging rules, the exclusion file semantics, and the analyzer/dedupe tooling with that in mind.
The large class movement is intentionally not in this PR. The rest of the stack applies the new model in smaller layers so reviewers can inspect each group without reading the entire migration at once.
Stack map
Testing and validation notes
spark-shared/*.classentries for both Scala 2.12 and Scala 2.13 OSS shim sets.Checklists
Documentation
Testing
(Covered by packaging/noSnapshots validation and the existing parallel-world packaging checks described above.)
Performance