Adapt Kudo and file scan callers to Java helpers#15049
Conversation
ea9be01 to
d6b101f
Compare
df44354 to
aa5c97a
Compare
af3b30c to
e1865fd
Compare
8eeb39d to
8717459
Compare
016543c to
c53fa85
Compare
8717459 to
1ffdf02
Compare
c53fa85 to
d0a0fb5
Compare
1ffdf02 to
9fb1bad
Compare
d0a0fb5 to
c983119
Compare
a206a28 to
eb2e327
Compare
Signed-off-by: Gera Shegalov <gshegalov@nvidia.com>
Signed-off-by: Gera Shegalov <gshegalov@nvidia.com>
11daef2 to
8a3927f
Compare
8cd5d42 to
df09b9c
Compare
Greptile SummaryThis layer of the unshim stack migrates
Confidence Score: 4/5The mechanical case-class-to-class and Logging migrations are correct, but ThreadPoolConfBuilder carries its SLF4J logger as a non-transient field while the class is Serializable and embedded in reader factories that Spark ships to executors. ThreadPoolConfBuilder.log is a plain private val (no @transient) inside a class that explicitly extends Serializable and is stored as a non-transient field in GpuOrcMultiFilePartitionReaderFactory and its Parquet counterparts. When Spark serializes those factories to dispatch to executors, the logger must be serializable too. Logback happens to be serializable, but Log4j2's SLF4J bridge (the default in many Spark 3.x deployments) is not, so this can surface as a NotSerializableException at task-dispatch time depending on the backend. The rest of the PR — Java FileUtils/SpillableKudoTable, Buffer casts, RapidsLocalLog, and the case-class migrations — looks mechanically correct. sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala — ThreadPoolConfBuilder logger field needs @transient. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[File Scan / Kudo Caller] --> B{Logging strategy}
B -->|Object or standalone| C[private val log via SLF4J directly]
B -->|Serializable reader factory| D[RapidsLocalLog trait]
D --> F[GpuOrcMultiFilePartitionReaderFactory\nGpuCSVPartitionReaderFactory\nOrcTableReader]
C --> E[ThreadPoolConfBuilder\nnon-transient log field]
E -->|stored as field in| F
G[FileUtils.java] --> H[TempFile wrapper]
H --> I[DumpUtils callers\nKudoTableOperator]
J[SpillableKudoTable.java] --> K[from / makeKudoTable]
K --> L[GpuColumnarBatchSerializer\nDumpUtilsSuite]
Reviews (1): Last reviewed commit: "Fix Delta coalescing reader helper const..." | Re-trigger Greptile |
| } | ||
| } | ||
|
|
||
| private def logWarning(msg: => String): Unit = { | ||
| log.warn(msg) | ||
| } | ||
|
|
There was a problem hiding this comment.
Non-transient logger in
Serializable class
ThreadPoolConfBuilder extends Serializable and is stored as a non-transient field in GpuOrcMultiFilePartitionReaderFactory and the Parquet equivalents — both of which are Java-serialized when Spark ships them to executors. The private val log field is not annotated @transient, so Java serialization will attempt to serialize the org.slf4j.Logger instance. With Log4j2 as the SLF4J backend (the default in Spark 3.x), org.apache.logging.slf4j.Log4jLogger does not implement java.io.Serializable, causing a NotSerializableException at task dispatch time. The RapidsLocalLog trait in this same PR correctly guards the logger with @transient private lazy val — the same pattern should be used here.
Related to #14834.
Description
This PR is one reviewable layer in the unshim stack introduced by #15025. It updates Kudo dump and file-scan callers to use the moved Java-friendly helper modules. The review surface is intentionally limited to Kudo/file scan integration points.
Stack context
Testing and validation notes
Checklists
Documentation
Testing
(Covered by the validation notes in the PR description.)
Performance