Open
Description
Describe the bug
The parquet spec says a uint8 or uint16 value must be an int32
annotated by INT(8, false), INT(16, false)
. A file with such values gets read into a int32
vector and the value read may be negative. When casting these values to the unsigned value, the cast method checks if the value is outside the range of valid values for an unsigned value. Since a negative value is outside the range the cast method will either return null or throw an error (depending on the specified cast option).
To Reproduce
I modified parquet/examples/read_parquet.rs
to read columns _9, and _10 from the attached file.
The file schema and contents as dumped by the parquet cli -
Schema
File path: ./alltypes_extended_plain.parquet
Created by: parquet-mr version 1.13.1 (build db4183109d5b734ec5930d870cdae161e408ddba)
Properties:
writer.model.name: example
Schema:
message root {
optional boolean _1;
optional int32 _2 (INTEGER(8,true));
optional int32 _3 (INTEGER(16,true));
optional int32 _4;
optional int64 _5;
optional float _6;
optional double _7;
optional binary _8 (STRING);
optional int32 _9 (INTEGER(8,false));
optional int32 _10 (INTEGER(16,false));
optional int32 _11 (INTEGER(32,false));
optional int64 _12 (INTEGER(64,false));
optional binary _13 (ENUM);
optional fixed_len_byte_array(3) _14;
optional int32 _15 (DECIMAL(5,2));
optional int64 _16 (DECIMAL(18,10));
optional fixed_len_byte_array(16) _17 (DECIMAL(38,37));
optional int64 _18 (TIMESTAMP(MILLIS,true));
optional int64 _19 (TIMESTAMP(MICROS,true));
optional int32 _20 (DATE);
}
Values -
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": true, "_2": 18, "_3": 10002, "_4": 10002, "_5": 10002, "_6": 10002.0, "_7": 10002.0, "_8": "100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002", "_9": -18, "_10": -10002, "_11": -10002, "_12": -10002, "_13": "10002", "_14": [50, 50, 50], "_15": 10002, "_16": 10002, "_17": [50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50], "_18": 10002, "_19": 10002, "_20": 10002}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": true, "_2": 20, "_3": 10004, "_4": 10004, "_5": 10004, "_6": 10004.0, "_7": 10004.0, "_8": "100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004", "_9": -20, "_10": -10004, "_11": -10004, "_12": -10004, "_13": "10004", "_14": [52, 52, 52], "_15": 10004, "_16": 10004, "_17": [52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52], "_18": 10004, "_19": 10004, "_20": 10004}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": true, "_2": 24, "_3": 10008, "_4": 10008, "_5": 10008, "_6": 10008.0, "_7": 10008.0, "_8": "100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008", "_9": -24, "_10": -10008, "_11": -10008, "_12": -10008, "_13": "10008", "_14": [56, 56, 56], "_15": 10008, "_16": 10008, "_17": [56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56], "_18": 10008, "_19": 10008, "_20": 10008}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
Results
+----+-----+
| _9 | _10 |
+----+-----+
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
+----+-----+
**Expected behavior**
Expect non-null values to be returned.
**Additional context**
Parquet file generated by Spark:
[alltypes_extended_plain.parquet.zip](https://github.com/user-attachments/files/18577979/alltypes_extended_plain.parquet.zip)