[AutoSparkUT] Fix ORC coalescing ignoreMissingFiles by wjxiz1992 · Pull Request #15103 · NVIDIA/cudf-spark

wjxiz1992 · 2026-06-17T08:40:15Z

This fixes the ORC coalescing reader path so it honors spark.files.ignoreMissingFiles / spark.sql.files.ignoreMissingFiles during the metadata filtering step.

Root cause: the coalescing ORC reader reads ORC tail metadata in filterStripes before the later partition reader layer can apply the existing missing-file handling. If a planned ORC file disappears after planning, filterStripes throws FileNotFoundException and the GPU path fails even when Spark is configured to ignore missing files.

Changes:

Catch FileNotFoundException during ORC coalescing stripe filtering only when ignoreMissingFiles is enabled.
Keep all other ORC/schema/corruption errors unchanged.
Add a regression test that plans an ORC scan, deletes a subset of planned files, forces the coalescing reader, and compares CPU/GPU results.

Validation:

mvn package -pl tests -am -Dbuildver=330 \
  -Dmaven.repo.local=./.mvn-repo \
  -DwildcardSuites=com.nvidia.spark.rapids.OrcScanSuite \
  -Drapids.test.gpu.allocFraction=0.3 \
  -Drapids.test.gpu.maxAllocFraction=0.3 \
  -Drapids.test.gpu.minAllocFraction=0 \
  -s jenkins/settings.xml -P mirror-apache-to-urm

Result:

BUILD SUCCESS
Tests: succeeded 11, failed 0, canceled 0, ignored 1, pending 0

Documentation

Updated for new or modified user-facing features or behaviors
No user-facing change

Testing

Added or modified tests to cover new code paths
Covered by existing tests
(Please provide the names of the existing tests in the PR description.)
Not required

Performance

Tests ran and results are added in the PR description
Issue filed with a link in the PR description
Not required

JaCoCo sql-plugin line coverage: +16 lines.

Signed-off-by: Allen Xu <allxu@nvidia.com>

Copilot

Pull request overview

This PR fixes the ORC coalescing reader path in the SQL plugin so it honors Spark’s ignoreMissingFiles setting when a file disappears after planning but before execution, aligning GPU behavior with Spark’s CPU behavior.

Changes:

Catch FileNotFoundException in GpuOrcScan coalescing stripe filtering when sqlConf.ignoreMissingFiles is enabled, skipping the missing file instead of failing the scan.
Add a regression test that deletes a subset of ORC files after planning and validates CPU/GPU results match when IGNORE_MISSING_FILES is enabled.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`tests/src/test/scala/com/nvidia/spark/rapids/OrcScanSuite.scala`	Adds a regression test for `ignoreMissingFiles` with the ORC coalescing reader.
`sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala`	Skips missing ORC files during coalescing `filterStripes` when `ignoreMissingFiles` is true.

+        val df = spark.read.format("orc").load(
+          firstPath.toString,
+          new Path(basePath, "second").toString,
+          thirdPath.toString,
+          new Path(basePath, "fourth").toString)


Signed-off-by: Allen Xu <allxu@nvidia.com>

wjxiz1992 · 2026-06-17T09:14:48Z

build

greptile-apps · 2026-06-24T01:50:58Z

Greptile Summary

This PR fixes the coalescing ORC reader path in GpuOrcMultiFilePartitionReaderFactory so it correctly honors spark.sql.files.ignoreMissingFiles during the metadata-filtering step (filterStripes). The root cause was that filterStripes reads ORC tail metadata upfront—before the later partition-reader layer that already handles missing files—so a vanished file threw an unguarded FileNotFoundException even when the configuration said to skip it.

Wraps the per-file filterStripes call in a try/catch that swallows FileNotFoundException only when ignoreMissingFiles is enabled, logging a warning, mirroring the identical guard already present in MultiFileCloudOrcPartitionReader.
Adds a null guard for the OrcPartitionReaderContext returned by filterStripes (empty ORC files return null); previously this would have caused a NullPointerException in the coalescing path.
Adds two Scala unit tests: one verifying that missing files are silently skipped and CPU/GPU results agree when ignoreMissingFiles=true, and one verifying that a FileNotFoundException is still propagated when the setting is false.

Confidence Score: 5/5

Safe to merge; the change is a narrow, well-targeted guard that matches the identical pattern already used in the cloud (multi-threaded) ORC reader path.

The fix is a one-method change that catches a specific exception type under a specific configuration flag, reproducing behavior that exists verbatim in the sibling reader class. The accompanying null check removes a latent NPE for empty ORC files in the coalescing path. Both regression tests (skip-when-true, throw-when-false) run successfully. No GPU allocations, no resource ownership changes, and no shim-layer impact.

No files require special attention.

Important Files Changed

Filename	Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala	Adds FileNotFoundException guard in buildBaseColumnarReaderForCoalescing + null-check for empty-ORC context; changes map→foreach for correctness; no resource leaks or GPU concerns.
tests/src/test/scala/com/nvidia/spark/rapids/OrcScanSuite.scala	Adds two new unit tests covering ignoreMissingFiles=true (skip) and ignoreMissingFiles=false (throw) for the coalescing ORC reader, with CPU/GPU comparison.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Spark as Spark Scheduler
    participant Factory as GpuOrcMultiFilePartitionReaderFactory
    participant Handler as GpuOrcFileFilterHandler
    participant FS as FileSystem

    Spark->>Factory: buildBaseColumnarReaderForCoalescing(files)
    loop For each PartitionedFile
        Factory->>Handler: filterStripes(file, ...)
        Handler->>FS: getOrcTail(filePath) — reads footer
        alt File exists
            FS-->>Handler: ORC tail metadata
            Handler-->>Factory: OrcPartitionReaderContext (or null for empty file)
            alt "context != null"
                Factory->>Factory: append stripes to compressionAndStripes
            end
        else File missing
            FS-->>Handler: FileNotFoundException
            Handler-->>Factory: FileNotFoundException propagated
            alt "ignoreMissingFiles == true"
                Factory->>Factory: logWarning, skip file
            else "ignoreMissingFiles == false"
                Factory-->>Spark: throw FileNotFoundException
            end
        end
    end
    Factory->>Factory: new MultiFileOrcPartitionReader(clippedStripes)
    Factory-->>Spark: PartitionReader[ColumnarBatch]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Spark as Spark Scheduler
    participant Factory as GpuOrcMultiFilePartitionReaderFactory
    participant Handler as GpuOrcFileFilterHandler
    participant FS as FileSystem

    Spark->>Factory: buildBaseColumnarReaderForCoalescing(files)
    loop For each PartitionedFile
        Factory->>Handler: filterStripes(file, ...)
        Handler->>FS: getOrcTail(filePath) — reads footer
        alt File exists
            FS-->>Handler: ORC tail metadata
            Handler-->>Factory: OrcPartitionReaderContext (or null for empty file)
            alt "context != null"
                Factory->>Factory: append stripes to compressionAndStripes
            end
        else File missing
            FS-->>Handler: FileNotFoundException
            Handler-->>Factory: FileNotFoundException propagated
            alt "ignoreMissingFiles == true"
                Factory->>Factory: logWarning, skip file
            else "ignoreMissingFiles == false"
                Factory-->>Spark: throw FileNotFoundException
            end
        end
    end
    Factory->>Factory: new MultiFileOrcPartitionReader(clippedStripes)
    Factory-->>Spark: PartitionReader[ColumnarBatch]

_{Reviews (3): Last reviewed commit: "Guard null filterStripes context in ORC ..." | Re-trigger Greptile}

res-life · 2026-06-25T01:44:52Z

+                  readDataSchema,
+                  OrcExtraInfo(orcPartitionReaderContext.requestedMapping)))
+          } catch {
+            case e: FileNotFoundException if ignoreMissingFiles =>


NIT:
Do we have test case to test scenario:
FileNotFoundException and ignoreMissingFiles is false.
Make sure the extenal behavior is FileNotFoundException for this scenario.
I know the behavior is correct, it's better to have a test case.

Good catch — added a test that deletes a planned ORC file with ignoreMissingFiles=false and asserts a FileNotFoundException surfaces (in the failure cause chain) on both CPU and GPU.

Complements the existing "honors ignoreMissingFiles" test with the negative case: when spark.sql.files.ignoreMissingFiles=false and a planned ORC file is deleted before read, the coalescing reader must surface a FileNotFoundException (verified via a cause-chain walk) on both CPU and GPU. Local validation: OrcScanSuite => Tests: succeeded 12, failed 0, canceled 0, ignored 1, pending 0; BUILD SUCCESS. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>

res-life · 2026-06-25T07:44:07Z

+          try {
+            val orcPartitionReaderContext = filterHandler.filterStripes(file, dataSchema,
+              readDataSchema, partitionSchema)
+            compressionAndStripes.getOrElseUpdate(orcPartitionReaderContext.compressionKind,


Could you guard the filterStripes result before dereferencing it? filterStripes can return null for an empty ORC file, and this line currently reads orcPartitionReaderContext.compressionKind before any null check, so that case would throw NullPointerException.

Please wrap all uses of orcPartitionReaderContext in the non-null branch, matching the existing ORC reader paths that handle a null context by producing/skipping empty input.

For example:

if (orcPartitionReaderContext != null) { compressionAndStripes.getOrElseUpdate(...) ... }

good catch — guarded it. An empty ORC file makes filterStripes return null, so that file is now skipped instead of dereferencing the context, matching the single-file path that uses EmptyPartitionReader.

wjxiz1992 · 2026-06-25T09:16:23Z

build

buildBaseColumnarReaderForCoalescing dereferenced orcPartitionReaderContext.compressionKind without a null check. filterStripes returns null for an empty ORC file (the resultedColPruneInfo .isEmpty branch), so an empty file in a coalesced read threw NPE. Wrap the stripe-collection in a non-null branch and skip the file, matching the single-file path that returns EmptyPartitionReader for a null context. Addresses review comment on NVIDIA#15103 (res-life r3472703942). Validated: mvn package -pl tests -am -Dbuildver=330 \ -DwildcardSuites=com.nvidia.spark.rapids.OrcScanSuite -> Tests: succeeded 12, failed 0, canceled 0, ignored 1, pending 0; BUILD SUCCESS. ### Review notes - nt-code-review: 0 must-fix. GPU-CPU parity confirmed (null -> skip = zero rows = single-file EmptyPartitionReader); coalescing was the only unguarded ORC path (cloud path already guards); count(*) unaffected. - Informational (not addressed): no test exercises the empty-schema-ORC coalescing path specifically; existing OrcScanSuite cases cover the FileNotFoundException catch path. The guard mirrors the proven single-file null handling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>

wjxiz1992 · 2026-06-26T06:10:29Z

build

Fix ORC coalescing ignore missing files

1a09e07

Signed-off-by: Allen Xu <allxu@nvidia.com>

wjxiz1992 marked this pull request as ready for review June 17, 2026 08:53

wjxiz1992 requested review from binmahone, Copilot and liurenjie1024 June 17, 2026 08:53

Copilot started reviewing on behalf of wjxiz1992 June 17, 2026 08:53 View session

wjxiz1992 requested a review from res-life June 17, 2026 08:53

Copilot AI reviewed Jun 17, 2026

View reviewed changes

Comment thread tests/src/test/scala/com/nvidia/spark/rapids/OrcScanSuite.scala

Comment on lines +105 to +109

val df = spark.read.format("orc").load(

firstPath.toString,

new Path(basePath, "second").toString,

thirdPath.toString,

new Path(basePath, "fourth").toString)

Address ORC scan review feedback

007ccea

Signed-off-by: Allen Xu <allxu@nvidia.com>

res-life reviewed Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AutoSparkUT] Fix ORC coalescing ignoreMissingFiles#15103

[AutoSparkUT] Fix ORC coalescing ignoreMissingFiles#15103
wjxiz1992 wants to merge 4 commits into
NVIDIA:mainfrom
wjxiz1992:fix/15100-orc-ignore-missing

wjxiz1992 commented Jun 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

wjxiz1992 commented Jun 17, 2026

Uh oh!

greptile-apps Bot commented Jun 24, 2026 •

edited

Loading

Uh oh!

res-life Jun 25, 2026

Uh oh!

wjxiz1992 Jun 25, 2026

Uh oh!

res-life Jun 25, 2026 •

edited

Loading

Uh oh!

wjxiz1992 Jun 26, 2026

Uh oh!

wjxiz1992 commented Jun 25, 2026

Uh oh!

wjxiz1992 commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

wjxiz1992 commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

wjxiz1992 commented Jun 17, 2026

Uh oh!

greptile-apps Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

res-life Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

wjxiz1992 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

res-life Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wjxiz1992 Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

wjxiz1992 commented Jun 25, 2026

Uh oh!

wjxiz1992 commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wjxiz1992 commented Jun 17, 2026 •

edited

Loading

greptile-apps Bot commented Jun 24, 2026 •

edited

Loading

res-life Jun 25, 2026 •

edited

Loading