Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion parquet/src/basic.rs
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ pub use crate::compression::{BrotliLevel, GzipLevel, ZstdLevel};
use crate::file::metadata::HeapSize;
use crate::parquet_thrift::{
ElementType, FieldType, ReadThrift, ThriftCompactInputProtocol, ThriftCompactOutputProtocol,
WriteThrift, WriteThriftField,
WriteThrift, WriteThriftField, validate_list_type,
};
use crate::{thrift_enum, thrift_struct, thrift_union_all_empty, write_thrift_field};

Expand Down Expand Up @@ -771,6 +771,8 @@ impl<'a, R: ThriftCompactInputProtocol<'a>> ReadThrift<'a, R> for EncodingMask {

// This reads a Thrift `list<Encoding>` and turns it into a bitmask
let list_ident = prot.read_list_begin()?;
// check for enum (encoded as I32)
validate_list_type(ElementType::I32, &list_ident)?;
for _ in 0..list_ident.size {
let val = Encoding::read_thrift(prot)?;
mask |= 1 << val as i32;
Expand Down
9 changes: 8 additions & 1 deletion parquet/src/file/metadata/thrift/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ use crate::{
parquet_thrift::{
ElementType, FieldType, ReadThrift, ThriftCompactInputProtocol,
ThriftCompactOutputProtocol, ThriftSliceInputProtocol, WriteThrift, WriteThriftField,
read_thrift_vec,
read_thrift_vec, validate_list_type,
},
schema::types::{
ColumnDescriptor, SchemaDescriptor, TypePtr, num_nodes, parquet_schema_from_array,
Expand Down Expand Up @@ -387,6 +387,9 @@ fn read_encoding_stats_as_mask<'a>(
// read the vector of stats, setting mask bits for data pages
let mut mask = 0i32;
let list_ident = prot.read_list_begin()?;
// check for PageEncodingStats struct
validate_list_type(ElementType::Struct, &list_ident)?;

for _ in 0..list_ident.size {
let pes = PageEncodingStats::read_thrift(prot)?;
match pes.page_type {
Expand Down Expand Up @@ -652,6 +655,8 @@ fn read_row_group(
match field_ident.id {
1 => {
let list_ident = prot.read_list_begin()?;
// check for list of struct
validate_list_type(ElementType::Struct, &list_ident)?;
if schema_descr.num_columns() != list_ident.size as usize {
return Err(general_err!(
"Column count mismatch. Schema has {} columns while Row Group has {}",
Expand Down Expand Up @@ -801,6 +806,8 @@ pub(crate) fn parquet_metadata_from_bytes(
}
let schema_descr = schema_descr.as_ref().unwrap();
let list_ident = prot.read_list_begin()?;
// check for list of struct
validate_list_type(ElementType::Struct, &list_ident)?;
let mut rg_vec = Vec::with_capacity(list_ident.size as usize);

// Read row groups and handle ordinal assignment
Expand Down
3 changes: 2 additions & 1 deletion parquet/src/file/page_index/offset_index.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ use std::io::Write;

use crate::parquet_thrift::{
ElementType, FieldType, ReadThrift, ThriftCompactInputProtocol, ThriftCompactOutputProtocol,
WriteThrift, WriteThriftField, read_thrift_vec,
WriteThrift, WriteThriftField, read_thrift_vec, validate_list_type,
};
use crate::{
errors::{ParquetError, Result},
Expand Down Expand Up @@ -90,6 +90,7 @@ impl OffsetIndexMetaData {

// we have to do this manually because we want to use the fast PageLocation decoder
let list_ident = prot.read_list_begin()?;
validate_list_type(ElementType::Struct, &list_ident)?;
let mut page_locations = Vec::with_capacity(list_ident.size as usize);
for _ in 0..list_ident.size {
page_locations.push(read_page_location(prot)?);
Expand Down
2 changes: 1 addition & 1 deletion parquet/src/file/serialized_reader.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2117,7 +2117,7 @@ mod tests {
let ret = SerializedFileReader::new(Bytes::copy_from_slice(&data));
assert_eq!(
ret.err().unwrap().to_string(),
"Parquet error: Received empty union from remote ColumnOrder"
"Parquet error: Expected list element type of Struct but got List"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data for this test is

[255, 172, 1, 0, 50, 82, 65, 73, 1, 0, 0, 0, 169, 168, 168, 162, 87, 255, 16, 0, 0, 0, 80, 65, 82, 49]
          |                                                               |            |
          start of footer                                                 length       PAR1                                                                                        

The x01 at the start decodes as a delta of 0 with a field type of BooleanTrue. Because delta is 0, a varint is read to obtain the field id, which consumes the 0 and returns a field id of 0, which is then skipped as unknown. The 50 (hex 0x32) encodes a delta of 3, with a field type of BooleanFalse. Because an i64 is expected, the 82 (hex x52) is consumed and returned as the value for num_rows (field 3). 65 (hex x41) is delta 4 -> field 7, with a type of BooleanTrue. Field 7 is a list of structs, so the 73 (hex x49) encodes 4 elements of type List. With the fix in this PR, the List is compared to the expected Struct type, and errors. Without the fix, encoding continues until a different error is detected.

Checking the expected field types would detect this even earlier (when the 3rd byte is consumed), but that is left for future work.

);
}

Expand Down
30 changes: 29 additions & 1 deletion parquet/src/parquet_thrift.rs
Original file line number Diff line number Diff line change
Expand Up @@ -702,9 +702,10 @@ impl<'a, R: ThriftCompactInputProtocol<'a>> ReadThrift<'a, R> for &'a [u8] {
pub(crate) fn read_thrift_vec<'a, T, R>(prot: &mut R) -> Result<Vec<T>>
where
R: ThriftCompactInputProtocol<'a>,
T: ReadThrift<'a, R>,
T: ReadThrift<'a, R> + WriteThrift,
{
let list_ident = prot.read_list_begin()?;
validate_list_type(T::ELEMENT_TYPE, &list_ident)?;
let mut res = Vec::with_capacity(list_ident.size as usize);
for _ in 0..list_ident.size {
let val = T::read_thrift(prot)?;
Expand All @@ -713,6 +714,17 @@ where
Ok(res)
}

pub(crate) fn validate_list_type(expected: ElementType, got: &ListIdentifier) -> Result<()> {
if got.element_type != expected {
return Err(general_err!(
"Expected list element type of {:?} but got {:?}",
expected,
got.element_type
));
}
Ok(())
}

/////////////////////////
// thrift compact output

Expand Down Expand Up @@ -1138,4 +1150,20 @@ pub(crate) mod tests {
let result = prot.read_list_begin();
assert!(result.is_err(), "expected error, got {result:?}");
}

#[test]
fn test_read_list_wrong_type() {
// list header: 4 elements of `Boolean`
let data = [0x42, 0x01];
let mut prot = ThriftSliceInputProtocol::new(&data);
// try to read as list<i32>
let result = read_thrift_vec::<i32, ThriftSliceInputProtocol>(&mut prot);
println!("{result:?}");
assert!(
result
.unwrap_err()
.to_string()
.contains("Expected list element type of I32 but got Bool")
);
}
}
2 changes: 1 addition & 1 deletion parquet/tests/arrow_reader/bad_data.rs
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ fn test_arrow_gh_41317() {
let err = read_file("ARROW-GH-41317.parquet").unwrap_err();
assert_eq!(
err.to_string(),
"External: Parquet argument error: Parquet error: StructArrayReader out of sync in read_records, expected 5 read, got 2"
"Parquet error: Expected list element type of I32 but got I16"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When first added, this test expected "External: Parquet argument error: External: bad data", but was changed to the current form during the thrift remodel. Apparently there's more than one error in this file.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a reasonable error mesage to me. The data is still rejected (and in this case earlier)

Copy link
Copy Markdown
Contributor Author

@etseidl etseidl May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that the file is intended to test a completely different error. I wonder if it's worth going in with a hex editor to fix the bad list so the only error is the column mismatch. I think I'll file an issue on parquet-testing.

);
}

Expand Down
Loading