-
Notifications
You must be signed in to change notification settings - Fork 277
refactor: rename scan.allowIncompatible to scan.unsignedSmallIntSafetyCheck #3238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: rename scan.allowIncompatible to scan.unsignedSmallIntSafetyCheck #3238
Conversation
…yCheck This change renames `spark.comet.scan.allowIncompatible` to `spark.comet.scan.unsignedSmallIntSafetyCheck` and changes its default to `true` (enabled by default). The key change is that ByteType is removed from the safety check entirely, leaving only ShortType subject to fallback behavior. ## Why ByteType is Safe ByteType columns are always safe for native execution because: 1. **Parquet type mapping**: Spark's ByteType can only originate from signed INT8 in Parquet. There is no unsigned 8-bit Parquet type (UINT_8) that maps to ByteType. 2. **UINT_8 maps to ShortType**: When Parquet files contain unsigned UINT_8 columns, Spark maps them to ShortType (16-bit), not ByteType. This is because UINT_8 values (0-255) exceed the signed byte range (-128 to 127). 3. **Truncation preserves signed values**: When storing signed INT8 in 8 bits, the truncation from any wider representation preserves the correct signed value due to two's complement representation. ## Why ShortType Needs the Safety Check ShortType columns may be problematic because: 1. **Ambiguous origin**: ShortType can come from either signed INT16 (safe) or unsigned UINT_8 (potentially incompatible). 2. **Different reader behavior**: Arrow-based readers like DataFusion respect the unsigned UINT_8 logical type and read data as unsigned, while Spark ignores the logical type and reads as signed. This can produce different results for values 128-255. 3. **No metadata available**: At scan time, Comet cannot determine whether a ShortType column originated from INT16 or UINT_8, so the safety check conservatively falls back to Spark for all ShortType columns. Users who know their data does not contain unsigned UINT_8 columns can disable the safety check with `spark.comet.scan.unsignedSmallIntSafetyCheck=false`. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use local `root_op` variable instead of unwrapping `exec_context.root_op` - Replace `is_some()` + `unwrap()` pattern with `if let Some(...)` Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3238 +/- ##
============================================
+ Coverage 56.12% 60.07% +3.95%
- Complexity 976 1437 +461
============================================
Files 119 172 +53
Lines 11743 15929 +4186
Branches 2251 2631 +380
============================================
+ Hits 6591 9570 +2979
- Misses 4012 5031 +1019
- Partials 1140 1328 +188 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
comphead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andygrove it is LGTM
Summary
spark.comet.scan.allowIncompatibletospark.comet.scan.unsignedSmallIntSafetyCheckfalsetotrue(safety check enabled by default)ByteTypefrom the safety check, leaving onlyShortTypesubject to fallbackWhy ByteType is Safe
ByteTypecolumns are always safe for native execution because:Parquet type mapping: Spark's
ByteTypecan only originate from signedINT8in Parquet. There is no unsigned 8-bit Parquet type (UINT_8) that maps toByteType.UINT_8 maps to ShortType: When Parquet files contain unsigned
UINT_8columns, Spark maps them toShortType(16-bit), notByteType. This is becauseUINT_8values (0-255) exceed the signed byte range (-128 to 127).Truncation preserves signed values: When storing signed
INT8in 8 bits, the truncation from any wider representation preserves the correct signed value due to two's complement representation.Why ShortType Needs the Safety Check
ShortTypecolumns may be problematic because:Ambiguous origin:
ShortTypecan come from either signedINT16(safe) or unsignedUINT_8(potentially incompatible).Different reader behavior: Arrow-based readers like DataFusion respect the unsigned
UINT_8logical type and read data as unsigned, while Spark ignores the logical type and reads as signed. This can produce different results for values 128-255.No metadata available: At scan time, Comet cannot determine whether a
ShortTypecolumn originated fromINT16orUINT_8, so the safety check conservatively falls back to Spark for allShortTypecolumns.Users who know their data does not contain unsigned
UINT_8columns can disable the safety check withspark.comet.scan.unsignedSmallIntSafetyCheck=false.🤖 Generated with Claude Code