Skip to content

[AutoSparkUT] Fix ORC coalescing ignoreMissingFiles#15103

Open
wjxiz1992 wants to merge 4 commits into
NVIDIA:mainfrom
wjxiz1992:fix/15100-orc-ignore-missing
Open

[AutoSparkUT] Fix ORC coalescing ignoreMissingFiles#15103
wjxiz1992 wants to merge 4 commits into
NVIDIA:mainfrom
wjxiz1992:fix/15100-orc-ignore-missing

Conversation

@wjxiz1992

@wjxiz1992 wjxiz1992 commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Closes #15100.

This fixes the ORC coalescing reader path so it honors spark.files.ignoreMissingFiles / spark.sql.files.ignoreMissingFiles during the metadata filtering step.

Root cause: the coalescing ORC reader reads ORC tail metadata in filterStripes before the later partition reader layer can apply the existing missing-file handling. If a planned ORC file disappears after planning, filterStripes throws FileNotFoundException and the GPU path fails even when Spark is configured to ignore missing files.

Changes:

  • Catch FileNotFoundException during ORC coalescing stripe filtering only when ignoreMissingFiles is enabled.
  • Keep all other ORC/schema/corruption errors unchanged.
  • Add a regression test that plans an ORC scan, deletes a subset of planned files, forces the coalescing reader, and compares CPU/GPU results.

Validation:

mvn package -pl tests -am -Dbuildver=330 \
  -Dmaven.repo.local=./.mvn-repo \
  -DwildcardSuites=com.nvidia.spark.rapids.OrcScanSuite \
  -Drapids.test.gpu.allocFraction=0.3 \
  -Drapids.test.gpu.maxAllocFraction=0.3 \
  -Drapids.test.gpu.minAllocFraction=0 \
  -s jenkins/settings.xml -P mirror-apache-to-urm

Result:

BUILD SUCCESS
Tests: succeeded 11, failed 0, canceled 0, ignored 1, pending 0

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
    (Please provide the names of the existing tests in the PR description.)
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

JaCoCo sql-plugin line coverage: +16 lines.

Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992 wjxiz1992 marked this pull request as ready for review June 17, 2026 08:53
@wjxiz1992 wjxiz1992 requested a review from res-life June 17, 2026 08:53

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes the ORC coalescing reader path in the SQL plugin so it honors Spark’s ignoreMissingFiles setting when a file disappears after planning but before execution, aligning GPU behavior with Spark’s CPU behavior.

Changes:

  • Catch FileNotFoundException in GpuOrcScan coalescing stripe filtering when sqlConf.ignoreMissingFiles is enabled, skipping the missing file instead of failing the scan.
  • Add a regression test that deletes a subset of ORC files after planning and validates CPU/GPU results match when IGNORE_MISSING_FILES is enabled.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
tests/src/test/scala/com/nvidia/spark/rapids/OrcScanSuite.scala Adds a regression test for ignoreMissingFiles with the ORC coalescing reader.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala Skips missing ORC files during coalescing filterStripes when ignoreMissingFiles is true.

Comment on lines +105 to +109
val df = spark.read.format("orc").load(
firstPath.toString,
new Path(basePath, "second").toString,
thirdPath.toString,
new Path(basePath, "fourth").toString)
Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992

Copy link
Copy Markdown
Collaborator Author

build

@greptile-apps

greptile-apps Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes the coalescing ORC reader path in GpuOrcMultiFilePartitionReaderFactory so it correctly honors spark.sql.files.ignoreMissingFiles during the metadata-filtering step (filterStripes). The root cause was that filterStripes reads ORC tail metadata upfront—before the later partition-reader layer that already handles missing files—so a vanished file threw an unguarded FileNotFoundException even when the configuration said to skip it.

  • Wraps the per-file filterStripes call in a try/catch that swallows FileNotFoundException only when ignoreMissingFiles is enabled, logging a warning, mirroring the identical guard already present in MultiFileCloudOrcPartitionReader.
  • Adds a null guard for the OrcPartitionReaderContext returned by filterStripes (empty ORC files return null); previously this would have caused a NullPointerException in the coalescing path.
  • Adds two Scala unit tests: one verifying that missing files are silently skipped and CPU/GPU results agree when ignoreMissingFiles=true, and one verifying that a FileNotFoundException is still propagated when the setting is false.

Confidence Score: 5/5

Safe to merge; the change is a narrow, well-targeted guard that matches the identical pattern already used in the cloud (multi-threaded) ORC reader path.

The fix is a one-method change that catches a specific exception type under a specific configuration flag, reproducing behavior that exists verbatim in the sibling reader class. The accompanying null check removes a latent NPE for empty ORC files in the coalescing path. Both regression tests (skip-when-true, throw-when-false) run successfully. No GPU allocations, no resource ownership changes, and no shim-layer impact.

No files require special attention.

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala Adds FileNotFoundException guard in buildBaseColumnarReaderForCoalescing + null-check for empty-ORC context; changes map→foreach for correctness; no resource leaks or GPU concerns.
tests/src/test/scala/com/nvidia/spark/rapids/OrcScanSuite.scala Adds two new unit tests covering ignoreMissingFiles=true (skip) and ignoreMissingFiles=false (throw) for the coalescing ORC reader, with CPU/GPU comparison.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Spark as Spark Scheduler
    participant Factory as GpuOrcMultiFilePartitionReaderFactory
    participant Handler as GpuOrcFileFilterHandler
    participant FS as FileSystem

    Spark->>Factory: buildBaseColumnarReaderForCoalescing(files)
    loop For each PartitionedFile
        Factory->>Handler: filterStripes(file, ...)
        Handler->>FS: getOrcTail(filePath) — reads footer
        alt File exists
            FS-->>Handler: ORC tail metadata
            Handler-->>Factory: OrcPartitionReaderContext (or null for empty file)
            alt "context != null"
                Factory->>Factory: append stripes to compressionAndStripes
            end
        else File missing
            FS-->>Handler: FileNotFoundException
            Handler-->>Factory: FileNotFoundException propagated
            alt "ignoreMissingFiles == true"
                Factory->>Factory: logWarning, skip file
            else "ignoreMissingFiles == false"
                Factory-->>Spark: throw FileNotFoundException
            end
        end
    end
    Factory->>Factory: new MultiFileOrcPartitionReader(clippedStripes)
    Factory-->>Spark: PartitionReader[ColumnarBatch]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Spark as Spark Scheduler
    participant Factory as GpuOrcMultiFilePartitionReaderFactory
    participant Handler as GpuOrcFileFilterHandler
    participant FS as FileSystem

    Spark->>Factory: buildBaseColumnarReaderForCoalescing(files)
    loop For each PartitionedFile
        Factory->>Handler: filterStripes(file, ...)
        Handler->>FS: getOrcTail(filePath) — reads footer
        alt File exists
            FS-->>Handler: ORC tail metadata
            Handler-->>Factory: OrcPartitionReaderContext (or null for empty file)
            alt "context != null"
                Factory->>Factory: append stripes to compressionAndStripes
            end
        else File missing
            FS-->>Handler: FileNotFoundException
            Handler-->>Factory: FileNotFoundException propagated
            alt "ignoreMissingFiles == true"
                Factory->>Factory: logWarning, skip file
            else "ignoreMissingFiles == false"
                Factory-->>Spark: throw FileNotFoundException
            end
        end
    end
    Factory->>Factory: new MultiFileOrcPartitionReader(clippedStripes)
    Factory-->>Spark: PartitionReader[ColumnarBatch]
Loading

Reviews (3): Last reviewed commit: "Guard null filterStripes context in ORC ..." | Re-trigger Greptile

readDataSchema,
OrcExtraInfo(orcPartitionReaderContext.requestedMapping)))
} catch {
case e: FileNotFoundException if ignoreMissingFiles =>

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT:
Do we have test case to test scenario:
FileNotFoundException and ignoreMissingFiles is false.
Make sure the extenal behavior is FileNotFoundException for this scenario.
I know the behavior is correct, it's better to have a test case.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — added a test that deletes a planned ORC file with ignoreMissingFiles=false and asserts a FileNotFoundException surfaces (in the failure cause chain) on both CPU and GPU.

Complements the existing "honors ignoreMissingFiles" test with the negative
case: when spark.sql.files.ignoreMissingFiles=false and a planned ORC file is
deleted before read, the coalescing reader must surface a FileNotFoundException
(verified via a cause-chain walk) on both CPU and GPU.

Local validation: OrcScanSuite => Tests: succeeded 12, failed 0, canceled 0,
ignored 1, pending 0; BUILD SUCCESS.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
try {
val orcPartitionReaderContext = filterHandler.filterStripes(file, dataSchema,
readDataSchema, partitionSchema)
compressionAndStripes.getOrElseUpdate(orcPartitionReaderContext.compressionKind,

@res-life res-life Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you guard the filterStripes result before dereferencing it? filterStripes can return null for an empty ORC file, and this line currently reads orcPartitionReaderContext.compressionKind before any null check, so that case would throw NullPointerException.

Please wrap all uses of orcPartitionReaderContext in the non-null branch, matching the existing ORC reader paths that handle a null context by producing/skipping empty input.

For example:

if (orcPartitionReaderContext != null) {
  compressionAndStripes.getOrElseUpdate(...)
  ...
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch — guarded it. An empty ORC file makes filterStripes return null, so that file is now skipped instead of dereferencing the context, matching the single-file path that uses EmptyPartitionReader.

@wjxiz1992

Copy link
Copy Markdown
Collaborator Author

build

buildBaseColumnarReaderForCoalescing dereferenced
orcPartitionReaderContext.compressionKind without a null check.
filterStripes returns null for an empty ORC file (the resultedColPruneInfo
.isEmpty branch), so an empty file in a coalesced read threw NPE. Wrap the
stripe-collection in a non-null branch and skip the file, matching the
single-file path that returns EmptyPartitionReader for a null context.

Addresses review comment on NVIDIA#15103 (res-life r3472703942).

Validated: mvn package -pl tests -am -Dbuildver=330 \
  -DwildcardSuites=com.nvidia.spark.rapids.OrcScanSuite ->
  Tests: succeeded 12, failed 0, canceled 0, ignored 1, pending 0; BUILD SUCCESS.

### Review notes
- nt-code-review: 0 must-fix. GPU-CPU parity confirmed (null -> skip = zero
  rows = single-file EmptyPartitionReader); coalescing was the only unguarded
  ORC path (cloud path already guards); count(*) unaffected.
- Informational (not addressed): no test exercises the empty-schema-ORC
  coalescing path specifically; existing OrcScanSuite cases cover the
  FileNotFoundException catch path. The guard mirrors the proven single-file
  null handling.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992

Copy link
Copy Markdown
Collaborator Author

build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG][AutoSparkUT]GPUOrcScan doesn't honor the spark.files.ignoreMissingFiles

4 participants