[AutoSparkUT] Fix ORC coalescing ignoreMissingFiles#15103
Conversation
Signed-off-by: Allen Xu <allxu@nvidia.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes the ORC coalescing reader path in the SQL plugin so it honors Spark’s ignoreMissingFiles setting when a file disappears after planning but before execution, aligning GPU behavior with Spark’s CPU behavior.
Changes:
- Catch
FileNotFoundExceptioninGpuOrcScancoalescing stripe filtering whensqlConf.ignoreMissingFilesis enabled, skipping the missing file instead of failing the scan. - Add a regression test that deletes a subset of ORC files after planning and validates CPU/GPU results match when
IGNORE_MISSING_FILESis enabled.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
tests/src/test/scala/com/nvidia/spark/rapids/OrcScanSuite.scala |
Adds a regression test for ignoreMissingFiles with the ORC coalescing reader. |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala |
Skips missing ORC files during coalescing filterStripes when ignoreMissingFiles is true. |
| val df = spark.read.format("orc").load( | ||
| firstPath.toString, | ||
| new Path(basePath, "second").toString, | ||
| thirdPath.toString, | ||
| new Path(basePath, "fourth").toString) |
Signed-off-by: Allen Xu <allxu@nvidia.com>
|
build |
| readDataSchema, | ||
| OrcExtraInfo(orcPartitionReaderContext.requestedMapping))) | ||
| } catch { | ||
| case e: FileNotFoundException if ignoreMissingFiles => |
There was a problem hiding this comment.
NIT:
Do we have test case to test scenario:
FileNotFoundException and ignoreMissingFiles is false.
Make sure the extenal behavior is FileNotFoundException for this scenario.
I know the behavior is correct, it's better to have a test case.
There was a problem hiding this comment.
Good catch — added a test that deletes a planned ORC file with ignoreMissingFiles=false and asserts a FileNotFoundException surfaces (in the failure cause chain) on both CPU and GPU.
Complements the existing "honors ignoreMissingFiles" test with the negative case: when spark.sql.files.ignoreMissingFiles=false and a planned ORC file is deleted before read, the coalescing reader must surface a FileNotFoundException (verified via a cause-chain walk) on both CPU and GPU. Local validation: OrcScanSuite => Tests: succeeded 12, failed 0, canceled 0, ignored 1, pending 0; BUILD SUCCESS. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>
| try { | ||
| val orcPartitionReaderContext = filterHandler.filterStripes(file, dataSchema, | ||
| readDataSchema, partitionSchema) | ||
| compressionAndStripes.getOrElseUpdate(orcPartitionReaderContext.compressionKind, |
There was a problem hiding this comment.
Could you guard the filterStripes result before dereferencing it? filterStripes can return null for an empty ORC file, and this line currently reads orcPartitionReaderContext.compressionKind before any null check, so that case would throw NullPointerException.
Please wrap all uses of orcPartitionReaderContext in the non-null branch, matching the existing ORC reader paths that handle a null context by producing/skipping empty input.
For example:
if (orcPartitionReaderContext != null) {
compressionAndStripes.getOrElseUpdate(...)
...
}There was a problem hiding this comment.
good catch — guarded it. An empty ORC file makes filterStripes return null, so that file is now skipped instead of dereferencing the context, matching the single-file path that uses EmptyPartitionReader.
|
build |
buildBaseColumnarReaderForCoalescing dereferenced orcPartitionReaderContext.compressionKind without a null check. filterStripes returns null for an empty ORC file (the resultedColPruneInfo .isEmpty branch), so an empty file in a coalesced read threw NPE. Wrap the stripe-collection in a non-null branch and skip the file, matching the single-file path that returns EmptyPartitionReader for a null context. Addresses review comment on NVIDIA#15103 (res-life r3472703942). Validated: mvn package -pl tests -am -Dbuildver=330 \ -DwildcardSuites=com.nvidia.spark.rapids.OrcScanSuite -> Tests: succeeded 12, failed 0, canceled 0, ignored 1, pending 0; BUILD SUCCESS. ### Review notes - nt-code-review: 0 must-fix. GPU-CPU parity confirmed (null -> skip = zero rows = single-file EmptyPartitionReader); coalescing was the only unguarded ORC path (cloud path already guards); count(*) unaffected. - Informational (not addressed): no test exercises the empty-schema-ORC coalescing path specifically; existing OrcScanSuite cases cover the FileNotFoundException catch path. The guard mirrors the proven single-file null handling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>
|
build |
Closes #15100.
This fixes the ORC coalescing reader path so it honors
spark.files.ignoreMissingFiles/spark.sql.files.ignoreMissingFilesduring the metadata filtering step.Root cause: the coalescing ORC reader reads ORC tail metadata in
filterStripesbefore the later partition reader layer can apply the existing missing-file handling. If a planned ORC file disappears after planning,filterStripesthrowsFileNotFoundExceptionand the GPU path fails even when Spark is configured to ignore missing files.Changes:
FileNotFoundExceptionduring ORC coalescing stripe filtering only whenignoreMissingFilesis enabled.Validation:
Result:
Documentation
Testing
(Please provide the names of the existing tests in the PR description.)
Performance
JaCoCo sql-plugin line coverage: +16 lines.