Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
- feat: Improve DataType display for `RunEndEncoded` [\#8596](https://github.com/apache/arrow-rs/pull/8596) [[arrow](https://github.com/apache/arrow-rs/labels/arrow)] ([Weijun-H](https://github.com/Weijun-H))
- Add `ArrowError::AvroError`, remaining types and roundtrip tests to `arrow-avro`, [\#8595](https://github.com/apache/arrow-rs/pull/8595) [[arrow](https://github.com/apache/arrow-rs/labels/arrow)] ([jecsand838](https://github.com/jecsand838))
- \[thrift-remodel\] Refactor Thrift encryption and store encodings as bitmask [\#8587](https://github.com/apache/arrow-rs/pull/8587) [[parquet](https://github.com/apache/arrow-rs/labels/parquet)] ([etseidl](https://github.com/etseidl))
- Change default page encoding stats decoding mode from `Full` to `Mask` in `ParquetMetaDataOptions` [\#8859](https://github.com/apache/arrow-rs/pull/8859) [[parquet](https://github.com/apache/arrow-rs/labels/parquet)] ([user](https://github.com/user))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRs shouldn't be changing the changelog for an already released version

@joe-ucp I suspect this PR is created by claude/codex or a similar tool. This is totally fine but we need your help to review the PR first before asking us to review it

It is not yet an arrow project guideline, but I personally think the guidelines from https://datafusion.apache.org/contributor-guide/index.html#ai-assisted-contributions helps undertand the background and what we generally expect from contrubutors

Thanks again

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb Thanks for calling this out, and for the link to the AI-assisted contribution guidelines.

You’re right: I used an LLM/Copilot-style workflow to help draft this change. I did run cargo fmt, cargo clippy, and cargo test -p parquet, but I’m still pretty new to Git/GitHub flow and self-taught, so I missed that I shouldn’t touch the changelog for already released versions and that I was overlaping with the work in #8797.

For this PR I’m happy to let it be closed in favor of #8797.

I’ll make sure future contributions are better self-reviewed before I open a PR and that I follow the AI guidelines you linked. Thanks again for the feedback (and the patience)!

- feat: Enhance `Map` display formatting in DataType [\#8570](https://github.com/apache/arrow-rs/pull/8570) [[arrow](https://github.com/apache/arrow-rs/labels/arrow)] ([Weijun-H](https://github.com/Weijun-H))
- feat: Enhance DataType display formatting for `ListView` and `LargeListView` variants [\#8569](https://github.com/apache/arrow-rs/pull/8569) [[arrow](https://github.com/apache/arrow-rs/labels/arrow)] ([Weijun-H](https://github.com/Weijun-H))
- Use custom thrift parser for parquet metadata \(phase 1 of Thrift remodel\) [\#8530](https://github.com/apache/arrow-rs/pull/8530) [[parquet](https://github.com/apache/arrow-rs/labels/parquet)] ([etseidl](https://github.com/etseidl))
Expand Down
35 changes: 33 additions & 2 deletions parquet/benches/metadata.rs
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ fn encoded_meta() -> Vec<u8> {
.set_data_page_offset(rng.random_range(4..2000000000))
.set_dictionary_page_offset(Some(rng.random_range(4..2000000000)))
.set_statistics(stats.clone())
.set_page_encoding_stats(vec![
.set_page_encoding_stats_full(vec![
PageEncodingStats {
page_type: PageType::DICTIONARY_PAGE,
encoding: Encoding::PLAIN,
Expand Down Expand Up @@ -165,14 +165,45 @@ fn criterion_benchmark(c: &mut Criterion) {
});

let schema = ParquetMetaDataReader::decode_schema(&meta_data).unwrap();
let options = ParquetMetaDataOptions::new().with_schema(schema);
let options = ParquetMetaDataOptions::new().with_schema(schema.clone());
c.bench_function("decode metadata with schema", |b| {
b.iter(|| {
ParquetMetaDataReader::decode_metadata_with_options(&meta_data, Some(&options))
.unwrap();
})
});

// Benchmark different page encoding stats modes
let options_mask = ParquetMetaDataOptions::new()
.with_schema(schema.clone())
.with_page_encoding_stats_mode(parquet::file::metadata::PageEncodingStatsMode::Mask);
c.bench_function("decode metadata (mask mode)", |b| {
b.iter(|| {
ParquetMetaDataReader::decode_metadata_with_options(&meta_data, Some(&options_mask))
.unwrap();
})
});

let options_full = ParquetMetaDataOptions::new()
.with_schema(schema.clone())
.with_page_encoding_stats_mode(parquet::file::metadata::PageEncodingStatsMode::Full);
c.bench_function("decode metadata (full mode)", |b| {
b.iter(|| {
ParquetMetaDataReader::decode_metadata_with_options(&meta_data, Some(&options_full))
.unwrap();
})
});

let options_skip = ParquetMetaDataOptions::new()
.with_schema(schema.clone())
.with_page_encoding_stats_mode(parquet::file::metadata::PageEncodingStatsMode::Skip);
c.bench_function("decode metadata (skip mode)", |b| {
b.iter(|| {
ParquetMetaDataReader::decode_metadata_with_options(&meta_data, Some(&options_skip))
.unwrap();
})
});

let buf: Bytes = black_box(encoded_meta()).into();
c.bench_function("decode parquet metadata (wide)", |b| {
b.iter(|| {
Expand Down
2 changes: 1 addition & 1 deletion parquet/src/arrow/array_reader/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ mod test_util;

// Note that this crate is public under the `experimental` feature flag.
use crate::file::metadata::RowGroupMetaData;
pub use builder::{ArrayReaderBuilder, CacheOptions, CacheOptionsBuilder};
pub use builder::{ArrayReaderBuilder, CacheOptionsBuilder};
pub use byte_array::make_byte_array_reader;
pub use byte_array_dictionary::make_byte_array_dictionary_reader;
#[allow(unused_imports)] // Only used for benchmarks
Expand Down
7 changes: 5 additions & 2 deletions parquet/src/arrow/arrow_writer/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ use crate::data_type::{ByteArray, FixedLenByteArray};
#[cfg(feature = "encryption")]
use crate::encryption::encrypt::FileEncryptor;
use crate::errors::{ParquetError, Result};
use crate::file::metadata::{KeyValue, ParquetMetaData, RowGroupMetaData};
use crate::file::metadata::{KeyValue, PageEncodingStatsMode, ParquetMetaData, ParquetMetaDataOptions, RowGroupMetaData};
use crate::file::properties::{WriterProperties, WriterPropertiesPtr};
use crate::file::reader::{ChunkReader, Length};
use crate::file::writer::{SerializedFileWriter, SerializedRowGroupWriter};
Expand Down Expand Up @@ -4433,7 +4433,10 @@ mod tests {
.unwrap();

// check that the read metadata is also correct
let options = ReadOptionsBuilder::new().with_page_index().build();
let options = ReadOptionsBuilder::new()
.with_page_index()
.with_metadata_options(ParquetMetaDataOptions::new().with_page_encoding_stats_mode(PageEncodingStatsMode::Full))
.build();
let reader = SerializedFileReader::new_with_options(file, options).unwrap();

let rowgroup = reader.get_row_group(0).expect("row group missing");
Expand Down
3 changes: 1 addition & 2 deletions parquet/src/arrow/schema/extension.rs
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@ use crate::basic::LogicalType;
use crate::errors::ParquetError;
use crate::schema::types::Type;
use arrow_schema::Field;
use arrow_schema::extension::ExtensionType;

/// Adds extension type metadata, if necessary, based on the Parquet field's
/// [`LogicalType`]
Expand All @@ -36,7 +35,7 @@ use arrow_schema::extension::ExtensionType;
/// Arrow DataType, and instead are represented by an Arrow ExtensionType.
/// Extension types are attached to Arrow Fields via metadata.
pub(crate) fn try_add_extension_type(
mut arrow_field: Field,
arrow_field: Field,
parquet_type: &Type,
) -> Result<Field, ParquetError> {
let Some(parquet_logical_type) = parquet_type.get_basic_info().logical_type_ref() else {
Expand Down
42 changes: 24 additions & 18 deletions parquet/src/column/writer/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ use crate::encryption::encrypt::get_column_crypto_metadata;
use crate::errors::{ParquetError, Result};
use crate::file::metadata::{
ColumnChunkMetaData, ColumnChunkMetaDataBuilder, ColumnIndexBuilder, LevelHistogram,
OffsetIndexBuilder, PageEncodingStats,
OffsetIndexBuilder, PageEncodingStats, ParquetPageEncodingStats,
};
use crate::file::properties::{
EnabledStatistics, WriterProperties, WriterPropertiesPtr, WriterVersion,
Expand Down Expand Up @@ -1191,7 +1191,7 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a, E> {
let mut builder = ColumnChunkMetaData::builder(self.descr.clone())
.set_compression(self.codec)
.set_encodings_mask(EncodingMask::new_from_encodings(self.encodings.iter()))
.set_page_encoding_stats(self.encoding_stats.clone())
.set_page_encoding_stats_full(self.encoding_stats.clone())
.set_total_compressed_size(total_compressed_size)
.set_total_uncompressed_size(total_uncompressed_size)
.set_num_values(num_values)
Expand Down Expand Up @@ -2485,21 +2485,24 @@ mod tests {
(PageType::DATA_PAGE, 1, 3),
]
);
assert_eq!(
r.metadata.page_encoding_stats(),
Some(&vec![
PageEncodingStats {
page_type: PageType::DICTIONARY_PAGE,
encoding: Encoding::PLAIN,
count: 1
},
PageEncodingStats {
page_type: PageType::DATA_PAGE,
encoding: Encoding::RLE_DICTIONARY,
count: 2,
}
])
);
match r.metadata.page_encoding_stats() {
Some(ParquetPageEncodingStats::Full(stats)) => assert_eq!(
stats,
&vec![
PageEncodingStats {
page_type: PageType::DICTIONARY_PAGE,
encoding: Encoding::PLAIN,
count: 1
},
PageEncodingStats {
page_type: PageType::DATA_PAGE,
encoding: Encoding::RLE_DICTIONARY,
count: 2,
}
]
),
_ => panic!("Expected full page encoding stats"),
}
}

#[test]
Expand Down Expand Up @@ -4104,7 +4107,10 @@ mod tests {
let meta = column_write_and_get_metadata::<T>(props, data);
assert_eq!(meta.dictionary_page_offset(), dictionary_page_offset);
assert_eq!(meta.encodings().collect::<Vec<_>>(), encodings);
assert_eq!(meta.page_encoding_stats().unwrap(), page_encoding_stats);
match meta.page_encoding_stats().unwrap() {
ParquetPageEncodingStats::Full(stats) => assert_eq!(stats, page_encoding_stats),
_ => panic!("Expected full page encoding stats"),
}
}

/// Returns column writer.
Expand Down
12 changes: 11 additions & 1 deletion parquet/src/file/metadata/memory.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@
use crate::basic::{BoundaryOrder, ColumnOrder, Compression, Encoding, PageType};
use crate::data_type::private::ParquetValueType;
use crate::file::metadata::{
ColumnChunkMetaData, FileMetaData, KeyValue, PageEncodingStats, RowGroupMetaData, SortingColumn,
ColumnChunkMetaData, FileMetaData, KeyValue, PageEncodingStats, ParquetPageEncodingStats,
RowGroupMetaData, SortingColumn,
};
use crate::file::page_index::column_index::{
ByteArrayColumnIndex, ColumnIndex, ColumnIndexMetaData, PrimitiveColumnIndex,
Expand Down Expand Up @@ -332,3 +333,12 @@ impl HeapSize for ColumnOrder {
0 // no heap allocations in ColumnOrder
}
}

impl HeapSize for ParquetPageEncodingStats {
fn heap_size(&self) -> usize {
match self {
ParquetPageEncodingStats::Mask(_) => 0, // EncodingMask has no heap allocations
ParquetPageEncodingStats::Full(vec) => vec.heap_size(),
}
}
}
Loading