Skip to content

Move Hadoop file I/O helpers to Java module#15030

Open
gerashegalov wants to merge 1 commit into
codex/unshim-stack-02a-api-formatfrom
codex/unshim-stack-02b-fileio
Open

Move Hadoop file I/O helpers to Java module#15030
gerashegalov wants to merge 1 commit into
codex/unshim-stack-02a-api-formatfrom
codex/unshim-stack-02b-fileio

Conversation

@gerashegalov

@gerashegalov gerashegalov commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Related to #14834.

Description

This PR is one reviewable layer in the unshim stack introduced by #15025. It moves Hadoop file I/O helper code into the Java-friendly file I/O module. This includes the S3 input-file plumbing pilot and keeps the file I/O runtime path separate from the broader Scala SQL plugin compilation surface.

Stack context

Testing and validation notes

  • No standalone behavior change is intended in this layer. It is covered by the full-stack packaging/build validation described in Add default common unshim packaging flow #15025 and the existing tests for the affected subsystem.
  • The full split stack was verified to be tree-equivalent to the pre-split stack top.

Checklists

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
    (Covered by the validation notes in the PR description.)
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02a-api-format branch from 84ed0c3 to 85f0628 Compare June 13, 2026 12:13
@gerashegalov gerashegalov force-pushed the codex/unshim-stack-02b-fileio branch from b5c391f to 112e88b Compare June 13, 2026 12:13
@gerashegalov gerashegalov marked this pull request as ready for review June 13, 2026 12:48
@greptile-apps

greptile-apps Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR is one reviewable layer in the unshim stack that moves Hadoop file I/O helpers from sql-plugin into the Java-only sql-plugin-fileio module, decoupling the file I/O runtime path from the broader Scala SQL plugin compilation surface.

  • RapidsInputFiles is promoted from a direct SparkEnv/PerfIOConf look-up to a volatile static bridge backed by the new S3PerfReader interface; the SQL-plugin-side PerfIOS3Reader implements the interface and is wired in by a later stack layer (Add Spark 3.3.0 SQL shim module sources #15043).
  • S3InputFile is rewritten from Scala to Java and moves to sql-plugin-fileio; the S3 routing logic is extracted into the new HadoopInputFileFactory extension point (PerfIOHadoopInputFileFactory), and HadoopFileIO gains a second constructor accepting an optional factory (existing callers using the single-arg constructor remain on the non-PerfIO path until the next layer completes the stack).
  • The deleted sql-plugin files (RapidsInputFiles.java, S3InputFile.scala) are cleanly superseded; no behavioural change is observable in isolation since the bridge registration call is deferred to the next PR in the stack.

Confidence Score: 4/5

Safe to merge as a stack layer; all existing callers continue on the unchanged non-PerfIO path until the next layer wires in the factory.

The migration is structurally clean and introduces no behaviour change for current callers. Two minor concerns: readVectored/readTail in S3InputFile throw IllegalArgumentException (not IOException) when the PerfIO bridge returns false, which can silently bypass standard I/O error handlers; and PerfIOHadoopInputFileFactory is missing a serialVersionUID despite implementing Serializable. Both are low-risk in the current usage but worth addressing before the factory is more widely deployed.

sql-plugin-fileio/src/main/java/com/nvidia/spark/rapids/fileio/hadoop/S3InputFile.java and PerfIOHadoopInputFileFactory.java are the two files worth a second look.

Important Files Changed

Filename Overview
sql-plugin-fileio/src/main/java/com/nvidia/spark/rapids/fileio/RapidsInputFiles.java New bridge class introducing a volatile static S3PerfReader singleton, replacing the previous direct SparkEnv/PerfIOConf look-up with a plugin-registered delegate; structure is clean and thread-safe for the expected single-writer lifecycle.
sql-plugin-fileio/src/main/java/com/nvidia/spark/rapids/fileio/hadoop/HadoopFileIO.java Moved from sql-plugin to sql-plugin-fileio; S3 routing extracted into optional HadoopInputFileFactory field; existing single-arg callers get null factory (non-PerfIO path) until next stack layer.
sql-plugin-fileio/src/main/java/com/nvidia/spark/rapids/fileio/hadoop/S3InputFile.java Port of Scala S3InputFile to Java; throws IllegalArgumentException (not IOException) when the PerfIO bridge returns false in readVectored/readTail, which is preserved behaviour from the Scala require() but surprises callers that only catch IOException.
sql-plugin-fileio/src/main/java/com/nvidia/spark/rapids/fileio/hadoop/PerfIOHadoopInputFileFactory.java Serializable singleton factory with readResolve(); missing serialVersionUID declaration which can cause InvalidClassException if the class is ever deserialized across a version boundary.
sql-plugin/src/main/java/com/nvidia/spark/rapids/fileio/hadoop/PerfIOS3Reader.java New SQL-plugin bridge to Scala PerfIO state; uses deprecated scala.collection.JavaConverters for Scala 2.13 compatibility, and readToHostMemory does not declare throws IOException even though calling methods do.
sql-plugin-fileio/src/main/java/com/nvidia/spark/rapids/fileio/hadoop/HadoopInputFileFactory.java Clean @FunctionalInterface extension point; extends Serializable is intentional to allow the factory to travel with a serialized HadoopFileIO.

Sequence Diagram

sequenceDiagram
    participant Caller as Caller (GpuParquetScan etc.)
    participant HFileIO as HadoopFileIO (sql-plugin-fileio)
    participant PerfFact as PerfIOHadoopInputFileFactory (sql-plugin-fileio)
    participant RIF as RapidsInputFiles (sql-plugin-fileio)
    participant Bridge as PerfIOS3Reader (sql-plugin)
    participant PerfIO as PerfIO$ (Scala, sql-plugin)
    participant HIF as HadoopInputFile (sql-plugin-fileio)
    participant S3IF as S3InputFile (sql-plugin-fileio)

    Caller->>HFileIO: newInputFile(path)
    alt "inputFileFactory == null (existing callers)"
        HFileIO->>HIF: create(path, conf)
        HIF-->>Caller: HadoopInputFile
    else "inputFileFactory != null (next stack layer)"
        HFileIO->>PerfFact: create(path, conf)
        PerfFact->>RIF: isS3PerfEnabled()
        RIF->>Bridge: isEnabled()
        Bridge->>Bridge: SparkEnv.get().conf()
        Bridge-->>RIF: true/false
        alt S3 scheme AND PerfIO enabled
            PerfFact->>S3IF: create(path, conf)
            S3IF-->>Caller: S3InputFile
            Caller->>S3IF: readVectored(output, ranges)
            S3IF->>RIF: readS3Vectored(...)
            RIF->>Bridge: readVectored(...)
            Bridge->>PerfIO: readToHostMemory(...)
            PerfIO-->>Bridge: Option[result]
            Bridge-->>S3IF: true/false
        else fallback
            PerfFact->>HIF: create(path, conf)
            HIF-->>Caller: HadoopInputFile
        end
    end
Loading

Reviews (1): Last reviewed commit: "Move Hadoop file I/O helpers to Java mod..." | Re-trigger Greptile

Comment on lines +74 to +77
throws IOException {
if (!RapidsInputFiles.readS3Vectored(hadoopConf, fileUri, output, copyRanges)) {
throw new IllegalArgumentException("expected to use PerfIO to read");
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 IllegalArgumentException escapes from a method declared throws IOException

Both readVectored and readTail throw IllegalArgumentException when the PerfIO bridge returns false. Because IllegalArgumentException is an unchecked exception, it will propagate straight through callers that only have a catch (IOException e) guard — the error will be invisible at every layer that handles I/O failures. The Scala original used require() with the same semantics, but in Java the disparity between the declared checked signature and the actual unchecked throw is more surprising. Wrapping in an IOException (throw new IOException("expected to use PerfIO to read")) would align intent with the declared signature and ensure the failure is caught by any standard I/O error handler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants