Skip to content

Remove old Spark 3.3.0 shim sources#15035

Open
gerashegalov wants to merge 1 commit into
codex/unshim-stack-02r-common-shim-helpersfrom
codex/unshim-stack-02s-shim-cleanup-330
Open

Remove old Spark 3.3.0 shim sources#15035
gerashegalov wants to merge 1 commit into
codex/unshim-stack-02r-common-shim-helpersfrom
codex/unshim-stack-02s-shim-cleanup-330

Conversation

@gerashegalov

@gerashegalov gerashegalov commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Related to #14834.

Description

This PR is one reviewable layer in the unshim stack introduced by #15025. It removes old Spark 3.3.0 shim sources that are now provided by shared helpers or the new helper modules. This is scoped to the OSS Spark 3.3.0 cleanup.

Stack context

Testing and validation notes

  • No standalone behavior change is intended in this layer. It is covered by the full-stack packaging/build validation described in Add default common unshim packaging flow #15025 and the existing tests for the affected subsystem.
  • The full split stack was verified to be tree-equivalent to the pre-split stack top.

Checklists

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
    (Covered by the validation notes in the PR description.)
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02r-common-shim-helpers branch from 0861ac7 to 3aedcbc Compare June 10, 2026 20:49
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02s-shim-cleanup-330 branch 2 times, most recently from 21ca89a to d54de00 Compare June 10, 2026 21:13
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02r-common-shim-helpers branch from 3aedcbc to 543d7d8 Compare June 10, 2026 21:13
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02s-shim-cleanup-330 branch from d54de00 to fad5d01 Compare June 10, 2026 21:32
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02r-common-shim-helpers branch 2 times, most recently from 8f6db57 to c80d61c Compare June 10, 2026 21:36
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02s-shim-cleanup-330 branch from fad5d01 to 499dac2 Compare June 10, 2026 21:36
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02r-common-shim-helpers branch from c80d61c to ddfb246 Compare June 10, 2026 22:20
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02s-shim-cleanup-330 branch 2 times, most recently from 04da28c to 5c9d602 Compare June 10, 2026 22:37
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02r-common-shim-helpers branch 2 times, most recently from 19375c6 to c271777 Compare June 10, 2026 22:41
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02s-shim-cleanup-330 branch 3 times, most recently from bb0cf14 to b2a7a9e Compare June 10, 2026 22:59
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02r-common-shim-helpers branch from 91a33db to ce1c95c Compare June 10, 2026 23:12
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02s-shim-cleanup-330 branch from b2a7a9e to 8bb1ce8 Compare June 10, 2026 23:12
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02r-common-shim-helpers branch from ce1c95c to b91960c Compare June 10, 2026 23:15
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02s-shim-cleanup-330 branch 2 times, most recently from e23faf4 to 77b338e Compare June 10, 2026 23:29
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02r-common-shim-helpers branch from b91960c to 6a596b3 Compare June 10, 2026 23:33
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02s-shim-cleanup-330 branch 5 times, most recently from 0ac124a to 9965c09 Compare June 11, 2026 00:37
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02r-common-shim-helpers branch from 6c50ba6 to a99339a Compare June 11, 2026 00:37
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02s-shim-cleanup-330 branch 7 times, most recently from 36f44d8 to 0e24722 Compare June 13, 2026 12:13
Signed-off-by: Gera Shegalov <gshegalov@nvidia.com>
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02s-shim-cleanup-330 branch from 0e24722 to 4ec5b87 Compare June 13, 2026 12:20
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02r-common-shim-helpers branch from a99339a to 0471b91 Compare June 13, 2026 12:20
@gerashegalov gerashegalov marked this pull request as ready for review June 13, 2026 12:49
@greptile-apps

greptile-apps Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR removes Spark 3.3.0-specific shim sources that are now covered by shared helper modules introduced earlier in the unshim stack. It also makes several backward-compatible adjustments to the remaining spark330 files to enable binary-class deduplication with newer shims.

  • Deletions (11 files): DateTimeUtilsShims, SequenceSizeTooLongErrorBuilder, SparkShimServiceProvider, ConvUtils, ShuffleManagerShims, FileCommitProtocolShims, OriginContextShim, SparkSessionUtils, SparkUpgradeExceptionShims, TrampolineConnectShims, and ShuffleClientShims are removed because they are now provided by shared helper modules in the stack.
  • Binary-dedupe alignment: Remaining spark330 files adopt intentional line-number padding, string-literal enum comparisons in LegacyBehaviorPolicyShim, and inline SLF4J loggers (replacing with Logging mixins) so compiled bytecode can be recognized as identical to the common-module class.
  • Reflection-based ORC protobuf dispatch (OrcProtoWriterShim): Replaced direct CodedOutputStream use with a lazy, reflection-driven protoApis registry that discovers both org.apache.orc.protobuf and com.google.protobuf at runtime, allowing the same compiled class to work across both protobuf variants.

Confidence Score: 4/5

Structured cleanup of spark330-specific shim sources; no new end-user logic is introduced, and the reflection-based ORC dispatch and Logging-to-SLF4J conversions are technically correct.

The reflection-based OrcProtoWriterShim and the GpuGroupedPythonRunnerFactory API surface changes are the two areas that deserve a second look. The reflection dispatch is sound — protoApis is lazily populated once and the per-instance cache is always flushed before reassignment — but the single-slot cache design is worth documenting. The removal of the case class companion apply and the argNames default value are intentional within the validated stack but are silent source-compatibility breaks for any downstream consumer.

OrcProtoWriterShim.scala (reflection-based protobuf dispatch) and GpuGroupedPythonRunnerFactory.scala (case class to class, default arg removal) warrant the closest review; deleted files and Logging-to-SLF4J conversions are straightforward.

Important Files Changed

Filename Overview
sql-plugin/src/main/spark330/scala/com/nvidia/spark/rapids/shims/OrcProtoWriterShim.scala Replaced direct protobuf CodedOutputStream usage with a reflection-based approach to support both org.apache.orc.protobuf and com.google.protobuf variants; adds lazy protoApis discovery and per-instance API caching in proxiedFor.
sql-plugin/src/main/spark330/scala/com/nvidia/spark/rapids/shims/NullIntolerantShim.scala Extended NullIntolerantShim with nullIntolerant: Boolean and added a new GpuLiteralShim abstract class with a jsonFields implementation covering Date/Timestamp serialization.
sql-plugin/src/main/spark330/scala/com/nvidia/spark/rapids/shuffle/RapidsShuffleIterator.scala Dropped Spark Logging mixin; replaced with explicit SLF4J logger and private log helpers for binary-dedupe; also fixes GpuSemaphore.acquireIfNecessary explicit eta-expansion.
sql-plugin/src/main/spark330/scala/org/apache/spark/sql/rapids/RapidsCachingReader.scala Dropped Spark Logging mixin; replaced with inline SLF4J logger with logInfo/logDebug helpers only, consistent with actual call sites in this class.
sql-plugin/src/main/spark330/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuGroupedPythonRunnerFactory.scala Changed from case class to class extends Serializable and removed the default argNames = None; removes auto-generated companion apply so callers must use new or be updated in the stack.
sql-plugin/src/main/spark330/scala/com/nvidia/spark/rapids/shims/SparkShims.scala Updated getDataWriteCmds to use new GpuOverrides.dataWriteCmdFromShim + CreateDataSourceTableAsSelectRules.dataWriteCmd helper pattern instead of the inline dataWriteCmd call.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/shims/Spark320PlusShims.scala Removed unused with Logging mixin and its import; no method bodies in this trait directly used the Logging helpers.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/shims/gpuWindows.scala Changed ParsedBoundary apply to new ParsedBoundary constructor call since the case class companion no longer exists in the newer shim.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[writeAndFlush called] --> B{apiFor obj}
    B -- None --> C[throw IllegalArgumentException]
    B -- Some api --> D{proxiedApi == api?}
    D -- Yes --> E[reuse cached proxied]
    D -- No --> F[invoke api.newInstance with orcOutStream]
    F --> G[update proxiedApi + proxied]
    G --> E
    E --> H[api.writeTo.invoke obj proxied]
    H --> I[api.flush.invoke proxied]
    I --> J[orcOutStream.flush]

    subgraph protoApis [lazy protoApis init]
        K[try org.apache.orc.protobuf] --> L{found?}
        L -- Yes --> M[add ProtoApi]
        L -- No --> N[skip]
        O[try com.google.protobuf] --> P{found?}
        P -- Yes --> Q[add ProtoApi]
        P -- No --> R[skip]
    end

    B -.->|first call triggers| protoApis
Loading

Comments Outside Diff (1)

  1. sql-plugin/src/main/spark330/scala/com/nvidia/spark/rapids/shims/OrcProtoWriterShim.scala, line 230-239 (link)

    P2 Per-instance API cache resets on type switch

    proxiedFor caches only the most-recently-used ProtoApi per OrcProtoWriterShim instance. If writeAndFlush is called alternately with messages from two different protobuf variants (one org.apache.orc.protobuf, one com.google.protobuf), a new CodedOutputStream wrapping orcOutStream is allocated on every call. The old stream is already flushed before reassignment, so no data is lost, but in a mixed-variant scenario the cache never hits. In practice ORC writers use a single protobuf variant per JVM, so this is unlikely to matter, but the single-slot, last-used nature of the cache is worth documenting.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Reviews (1): Last reviewed commit: "Remove old Spark 3.3.0 shim sources" | Re-trigger Greptile

Comment on lines 47 to 56

case class GpuGroupedPythonRunnerFactory(
class GpuGroupedPythonRunnerFactory(
conf: org.apache.spark.sql.internal.SQLConf,
chainedFunc: Seq[(ChainedPythonFunctions, Long)],
argOffsets: Array[Array[Int]],
dedupAttrs: StructType,
pythonOutputSchema: StructType,
evalType: Int,
argNames: Option[Array[Array[Option[String]]]] = None) {
argNames: Option[Array[Array[Option[String]]]]) extends Serializable {
val sessionLocalTimeZone = conf.sessionLocalTimeZone

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Removal of case class companion apply and default argNames

Converting from case class to class removes the auto-generated companion apply factory method. Any call site using GpuGroupedPythonRunnerFactory(args...) without new will no longer compile. Removing = None from argNames also means every call site must now supply the argument explicitly. Both are intentional within the validated stack, but represent silent source-compatibility breaks for any downstream or out-of-tree consumer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants