Skip to content

Adapt Kudo and file scan callers to Java helpers#15049

Open
gerashegalov wants to merge 3 commits into
codex/unshim-stack-02l-rule-metadata-callersfrom
codex/unshim-stack-02m-kudo-file-scans
Open

Adapt Kudo and file scan callers to Java helpers#15049
gerashegalov wants to merge 3 commits into
codex/unshim-stack-02l-rule-metadata-callersfrom
codex/unshim-stack-02m-kudo-file-scans

Conversation

@gerashegalov

@gerashegalov gerashegalov commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Related to #14834.

Description

This PR is one reviewable layer in the unshim stack introduced by #15025. It updates Kudo dump and file-scan callers to use the moved Java-friendly helper modules. The review surface is intentionally limited to Kudo/file scan integration points.

Stack context

Testing and validation notes

  • No standalone behavior change is intended in this layer. It is covered by the full-stack packaging/build validation described in Add default common unshim packaging flow #15025 and the existing tests for the affected subsystem.
  • The full split stack was verified to be tree-equivalent to the pre-split stack top.

Checklists

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
    (Covered by the validation notes in the PR description.)
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02l-rule-metadata-callers branch from ea9be01 to d6b101f Compare June 10, 2026 20:49
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02m-kudo-file-scans branch 2 times, most recently from df44354 to aa5c97a Compare June 10, 2026 21:13
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02l-rule-metadata-callers branch 2 times, most recently from af3b30c to e1865fd Compare June 10, 2026 21:32
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02m-kudo-file-scans branch 2 times, most recently from 8eeb39d to 8717459 Compare June 10, 2026 21:36
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02l-rule-metadata-callers branch from 016543c to c53fa85 Compare June 10, 2026 22:20
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02m-kudo-file-scans branch from 8717459 to 1ffdf02 Compare June 10, 2026 22:20
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02l-rule-metadata-callers branch from c53fa85 to d0a0fb5 Compare June 10, 2026 22:37
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02m-kudo-file-scans branch from 1ffdf02 to 9fb1bad Compare June 10, 2026 22:37
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02l-rule-metadata-callers branch from d0a0fb5 to c983119 Compare June 10, 2026 22:41
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02m-kudo-file-scans branch 5 times, most recently from a206a28 to eb2e327 Compare June 10, 2026 23:33
Signed-off-by: Gera Shegalov <gshegalov@nvidia.com>
Signed-off-by: Gera Shegalov <gshegalov@nvidia.com>
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02l-rule-metadata-callers branch from 11daef2 to 8a3927f Compare June 13, 2026 12:13
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02m-kudo-file-scans branch from 8cd5d42 to df09b9c Compare June 13, 2026 12:13
@gerashegalov gerashegalov marked this pull request as ready for review June 13, 2026 12:49
@greptile-apps

greptile-apps Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This layer of the unshim stack migrates FileUtils and SpillableKudoTable from Scala to Java and converts their Scala callers (Kudo dump, file-scan factories, ORC/Parquet/Avro/CSV readers) to use the new Java-friendly APIs. As a byproduct, many internal case class types are changed to plain class (adding Serializable and explicit field accessors), and org.apache.spark.internal.Logging is replaced throughout by the new RapidsLocalLog trait or direct SLF4J calls. ByteBuffer position calls are cast through Buffer for Java 8/11+ cross-version compatibility.

  • Scala FileUtils and SpillableKudoTable deleted; Java equivalents added; all call sites updated to use .getOutputStream, .getPath, .from(), and .makeKudoTable().
  • RapidsLocalLog trait introduced in GpuExec.scala with @transient private lazy val logger, used by reader factories and table producers that extend Serializable.
  • ThreadPoolConfBuilder.log added without @transient in a Serializable class whose instance is embedded as a non-transient field in reader factories dispatched to executors.

Confidence Score: 4/5

The mechanical case-class-to-class and Logging migrations are correct, but ThreadPoolConfBuilder carries its SLF4J logger as a non-transient field while the class is Serializable and embedded in reader factories that Spark ships to executors.

ThreadPoolConfBuilder.log is a plain private val (no @transient) inside a class that explicitly extends Serializable and is stored as a non-transient field in GpuOrcMultiFilePartitionReaderFactory and its Parquet counterparts. When Spark serializes those factories to dispatch to executors, the logger must be serializable too. Logback happens to be serializable, but Log4j2's SLF4J bridge (the default in many Spark 3.x deployments) is not, so this can surface as a NotSerializableException at task-dispatch time depending on the backend. The rest of the PR — Java FileUtils/SpillableKudoTable, Buffer casts, RapidsLocalLog, and the case-class migrations — looks mechanically correct.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala — ThreadPoolConfBuilder logger field needs @transient.

Important Files Changed

Filename Overview
sql-plugin/src/main/java/com/nvidia/spark/rapids/SpillableKudoTable.java Direct Java translation of the deleted Scala SpillableKudoTable; constructor semantics, close(), and makeKudoTable() are equivalent to the removed Scala version.
sql-plugin-fileio/src/main/java/com/nvidia/spark/rapids/FileUtils.java Java replacement for the deleted Scala FileUtils; createTempFile loop logic is identical to the original, now returning a TempFile wrapper with typed getters.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuExec.scala Introduces RapidsLocalLog trait as a Spark-Logging-free alternative; correctly uses @transient private lazy val for the logger to survive serialization.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala Migrates case class/object logging to RapidsLocalLog or direct SLF4J; ThreadPoolConfBuilder.log is non-transient in a Serializable class, risking NotSerializableException with Log4j2.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala Large migration from case class to class throughout ORC scan stack; adds Buffer casts for Java 8/11+ ByteBuffer cross-version compat; PerfIOHadoopInputFileFactory.INSTANCE added to HadoopFileIO.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/parquet/GpuParquetScan.scala Same case-class-to-class migration as ORC; Buffer casts for cross-JDK ByteBuffer compat; PerfIOHadoopInputFileFactory.INSTANCE passed to both fileIO lazy vals.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffleCoalesceExec.scala Adapts KudoTableOperator to Java FileUtils.TempFile API; case class to class migrations for CloseableTableSeqWithTargetSize and RowCountOnlyMergeResult; Serializable added where needed.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/DumpUtils.scala Callers updated from tuple-destructuring FileUtils.createTempFile to TempFile accessor methods; stream still correctly wrapped in withResource.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AvroDataFileReader.scala Moves import from TrampolineConnectShims to TrampolineUtil.createSchemaParser(); updates to Java setter API for MutableBlockInfo.
sql-plugin/src/main/scala/org/apache/spark/sql/hive/rapids/GpuHiveTableScanExec.scala Introduces shimSparkSession helper to avoid ambiguous SparkSession type reference; all sparkSession usages replaced consistently.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[File Scan / Kudo Caller] --> B{Logging strategy}
    B -->|Object or standalone| C[private val log via SLF4J directly]
    B -->|Serializable reader factory| D[RapidsLocalLog trait]
    D --> F[GpuOrcMultiFilePartitionReaderFactory\nGpuCSVPartitionReaderFactory\nOrcTableReader]
    C --> E[ThreadPoolConfBuilder\nnon-transient log field]
    E -->|stored as field in| F
    G[FileUtils.java] --> H[TempFile wrapper]
    H --> I[DumpUtils callers\nKudoTableOperator]
    J[SpillableKudoTable.java] --> K[from / makeKudoTable]
    K --> L[GpuColumnarBatchSerializer\nDumpUtilsSuite]
Loading

Reviews (1): Last reviewed commit: "Fix Delta coalescing reader helper const..." | Re-trigger Greptile

Comment on lines +431 to +437
}
}

private def logWarning(msg: => String): Unit = {
log.warn(msg)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Non-transient logger in Serializable class

ThreadPoolConfBuilder extends Serializable and is stored as a non-transient field in GpuOrcMultiFilePartitionReaderFactory and the Parquet equivalents — both of which are Java-serialized when Spark ships them to executors. The private val log field is not annotated @transient, so Java serialization will attempt to serialize the org.slf4j.Logger instance. With Log4j2 as the SLF4J backend (the default in Spark 3.x), org.apache.logging.slf4j.Log4jLogger does not implement java.io.Serializable, causing a NotSerializableException at task dispatch time. The RapidsLocalLog trait in this same PR correctly guards the logger with @transient private lazy val — the same pattern should be used here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants