Skip to content

Allow Parquet reader to read incorrectly written (negative) uint8, uint16 values for compatibility #7040

Open
@parthchandra

Description

@parthchandra

Describe the bug
The parquet spec says a uint8 or uint16 value must be an int32 annotated by INT(8, false), INT(16, false). A file with such values gets read into a int32 vector and the value read may be negative. When casting these values to the unsigned value, the cast method checks if the value is outside the range of valid values for an unsigned value. Since a negative value is outside the range the cast method will either return null or throw an error (depending on the specified cast option).

To Reproduce
I modified parquet/examples/read_parquet.rs to read columns _9, and _10 from the attached file.

The file schema and contents as dumped by the parquet cli -

Schema

File path:  ./alltypes_extended_plain.parquet
Created by: parquet-mr version 1.13.1 (build db4183109d5b734ec5930d870cdae161e408ddba)
Properties:
  writer.model.name: example
Schema:
message root {
  optional boolean _1;
  optional int32 _2 (INTEGER(8,true));
  optional int32 _3 (INTEGER(16,true));
  optional int32 _4;
  optional int64 _5;
  optional float _6;
  optional double _7;
  optional binary _8 (STRING);
  optional int32 _9 (INTEGER(8,false));
  optional int32 _10 (INTEGER(16,false));
  optional int32 _11 (INTEGER(32,false));
  optional int64 _12 (INTEGER(64,false));
  optional binary _13 (ENUM);
  optional fixed_len_byte_array(3) _14;
  optional int32 _15 (DECIMAL(5,2));
  optional int64 _16 (DECIMAL(18,10));
  optional fixed_len_byte_array(16) _17 (DECIMAL(38,37));
  optional int64 _18 (TIMESTAMP(MILLIS,true));
  optional int64 _19 (TIMESTAMP(MICROS,true));
  optional int32 _20 (DATE);
}

Values -

{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": true, "_2": 18, "_3": 10002, "_4": 10002, "_5": 10002, "_6": 10002.0, "_7": 10002.0, "_8": "100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002", "_9": -18, "_10": -10002, "_11": -10002, "_12": -10002, "_13": "10002", "_14": [50, 50, 50], "_15": 10002, "_16": 10002, "_17": [50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50], "_18": 10002, "_19": 10002, "_20": 10002}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": true, "_2": 20, "_3": 10004, "_4": 10004, "_5": 10004, "_6": 10004.0, "_7": 10004.0, "_8": "100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004", "_9": -20, "_10": -10004, "_11": -10004, "_12": -10004, "_13": "10004", "_14": [52, 52, 52], "_15": 10004, "_16": 10004, "_17": [52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52], "_18": 10004, "_19": 10004, "_20": 10004}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": true, "_2": 24, "_3": 10008, "_4": 10008, "_5": 10008, "_6": 10008.0, "_7": 10008.0, "_8": "100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008", "_9": -24, "_10": -10008, "_11": -10008, "_12": -10008, "_13": "10008", "_14": [56, 56, 56], "_15": 10008, "_16": 10008, "_17": [56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56], "_18": 10008, "_19": 10008, "_20": 10008}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}

Results

+----+-----+
| _9 | _10 |
+----+-----+
|    |     |
|    |     |
|    |     |
|    |     |
|    |     |
|    |     |
|    |     |
|    |     |
|    |     |
|    |     |
+----+-----+

**Expected behavior**
Expect non-null values to be returned.

**Additional context**
Parquet file generated by Spark:  

[alltypes_extended_plain.parquet.zip](https://github.com/user-attachments/files/18577979/alltypes_extended_plain.parquet.zip)

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions