refactor: rename scan.allowIncompatible to scan.unsignedSmallIntSafetyCheck #3238

andygrove · 2026-01-22T01:54:10Z

Summary

Renames spark.comet.scan.allowIncompatible to spark.comet.scan.unsignedSmallIntSafetyCheck
Changes default from false to true (safety check enabled by default)
Removes ByteType from the safety check, leaving only ShortType subject to fallback

Why ByteType is Safe

ByteType columns are always safe for native execution because:

Parquet type mapping: Spark's ByteType can only originate from signed INT8 in Parquet. There is no unsigned 8-bit Parquet type (UINT_8) that maps to ByteType.
UINT_8 maps to ShortType: When Parquet files contain unsigned UINT_8 columns, Spark maps them to ShortType (16-bit), not ByteType. This is because UINT_8 values (0-255) exceed the signed byte range (-128 to 127).
Truncation preserves signed values: When storing signed INT8 in 8 bits, the truncation from any wider representation preserves the correct signed value due to two's complement representation.

Why ShortType Needs the Safety Check

ShortType columns may be problematic because:

Ambiguous origin: ShortType can come from either signed INT16 (safe) or unsigned UINT_8 (potentially incompatible).
Different reader behavior: Arrow-based readers like DataFusion respect the unsigned UINT_8 logical type and read data as unsigned, while Spark ignores the logical type and reads as signed. This can produce different results for values 128-255.
No metadata available: At scan time, Comet cannot determine whether a ShortType column originated from INT16 or UINT_8, so the safety check conservatively falls back to Spark for all ShortType columns.

Users who know their data does not contain unsigned UINT_8 columns can disable the safety check with spark.comet.scan.unsignedSmallIntSafetyCheck=false.

🤖 Generated with Claude Code

…yCheck This change renames `spark.comet.scan.allowIncompatible` to `spark.comet.scan.unsignedSmallIntSafetyCheck` and changes its default to `true` (enabled by default). The key change is that ByteType is removed from the safety check entirely, leaving only ShortType subject to fallback behavior. ## Why ByteType is Safe ByteType columns are always safe for native execution because: 1. **Parquet type mapping**: Spark's ByteType can only originate from signed INT8 in Parquet. There is no unsigned 8-bit Parquet type (UINT_8) that maps to ByteType. 2. **UINT_8 maps to ShortType**: When Parquet files contain unsigned UINT_8 columns, Spark maps them to ShortType (16-bit), not ByteType. This is because UINT_8 values (0-255) exceed the signed byte range (-128 to 127). 3. **Truncation preserves signed values**: When storing signed INT8 in 8 bits, the truncation from any wider representation preserves the correct signed value due to two's complement representation. ## Why ShortType Needs the Safety Check ShortType columns may be problematic because: 1. **Ambiguous origin**: ShortType can come from either signed INT16 (safe) or unsigned UINT_8 (potentially incompatible). 2. **Different reader behavior**: Arrow-based readers like DataFusion respect the unsigned UINT_8 logical type and read data as unsigned, while Spark ignores the logical type and reads as signed. This can produce different results for values 128-255. 3. **No metadata available**: At scan time, Comet cannot determine whether a ShortType column originated from INT16 or UINT_8, so the safety check conservatively falls back to Spark for all ShortType columns. Users who know their data does not contain unsigned UINT_8 columns can disable the safety check with `spark.comet.scan.unsignedSmallIntSafetyCheck=false`. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Use local `root_op` variable instead of unwrapping `exec_context.root_op` - Replace `is_some()` + `unwrap()` pattern with `if let Some(...)` Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

codecov-commenter · 2026-01-24T12:57:29Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.07%. Comparing base (f09f8af) to head (66e3186).
⚠️ Report is 878 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3238      +/-   ##
============================================
+ Coverage     56.12%   60.07%   +3.95%     
- Complexity      976     1437     +461     
============================================
  Files           119      172      +53     
  Lines         11743    15929    +4186     
  Branches       2251     2631     +380     
============================================
+ Hits           6591     9570    +2979     
- Misses         4012     5031    +1019     
- Partials       1140     1328     +188

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

comphead

Thanks @andygrove it is LGTM

andygrove marked this pull request as ready for review January 22, 2026 01:55

andygrove marked this pull request as draft January 22, 2026 02:18

andygrove and others added 5 commits January 22, 2026 07:34

format

8185dae

rename

b0305d2

rename

0bae43c

Fix clippy warnings for Rust 1.93

537d62e

- Use local `root_op` variable instead of unwrapping `exec_context.root_op` - Replace `is_some()` + `unwrap()` pattern with `if let Some(...)` Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Merge branch 'fix-clippy-rust-1.93' into unsigned-small-int-safety-check

4f6b2a8

andygrove modified the milestone: 0.13.0 Jan 22, 2026

andygrove added 2 commits January 23, 2026 11:29

upmerge

2c68bbb

upmerge

66e3186

andygrove marked this pull request as ready for review January 24, 2026 15:54

comphead approved these changes Jan 25, 2026

View reviewed changes

comphead merged commit 1a000ab into apache:main Jan 25, 2026
130 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: rename scan.allowIncompatible to scan.unsignedSmallIntSafetyCheck #3238

refactor: rename scan.allowIncompatible to scan.unsignedSmallIntSafetyCheck #3238

andygrove commented Jan 22, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jan 24, 2026 •

edited

Loading

Uh oh!

comphead left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

refactor: rename scan.allowIncompatible to scan.unsignedSmallIntSafetyCheck #3238

refactor: rename scan.allowIncompatible to scan.unsignedSmallIntSafetyCheck #3238

Conversation

andygrove commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why ByteType is Safe

Why ShortType Needs the Safety Check

Uh oh!

codecov-commenter commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andygrove commented Jan 22, 2026 •

edited

Loading

codecov-commenter commented Jan 24, 2026 •

edited

Loading