Skip to content

Parquet StatisticsConverter does not work for struct columns #7364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kylebarron opened this issue Mar 31, 2025 · 0 comments · May be fixed by #7365
Open

Parquet StatisticsConverter does not work for struct columns #7364

kylebarron opened this issue Mar 31, 2025 · 0 comments · May be fixed by #7365
Labels

Comments

@kylebarron
Copy link
Contributor

kylebarron commented Mar 31, 2025

Describe the bug

The StatisticsConverter produces all-null columns for struct-type fields.

To Reproduce

#[cfg(test)]
mod test_geoparquet {
    use std::sync::Arc;

    use arrow::array::AsArray;
    use arrow::datatypes::Float32Type;
    use object_store::aws::AmazonS3Builder;
    use parquet::arrow::arrow_reader::ArrowReaderMetadata;
    use parquet::arrow::async_reader::ParquetObjectReader;

    use super::*;

    #[tokio::test]
    async fn test_struct_geoparquet() {
        let store = Arc::new(
            AmazonS3Builder::new()
                .with_bucket_name("overturemaps-us-west-2")
                .with_skip_signature(true)
                .with_region("us-west-2")
                .build()
                .unwrap(),
        );
        let path = "release/2025-02-19.0/theme=addresses/type=address/part-00010-e084a2d7-fea9-41e5-a56f-e638a3307547-c000.zstd.parquet";
        let mut object_reader = ParquetObjectReader::new(store, path.into());
        let meta = ArrowReaderMetadata::load_async(&mut object_reader, Default::default())
            .await
            .unwrap();

        let parquet_schema = meta.parquet_schema();
        let column_desc = parquet_schema.column(2);

        let min_bytes = meta
            .metadata()
            .row_group(0)
            .column(2)
            .statistics()
            .unwrap()
            .min_bytes_opt()
            .unwrap();

        let statistics_value_direct = f32::from_le_bytes(min_bytes.try_into().unwrap());

        let converter =
            StatisticsConverter::try_new("bbox", meta.schema(), meta.parquet_schema()).unwrap();
        let mins = converter
            .row_group_mins(meta.metadata().row_groups())
            .unwrap();
        let mins_struct = mins.as_struct();
        let minx_bbox_minx = mins_struct.column(0).as_primitive::<Float32Type>();

        // This line fails
        assert!(minx_bbox_minx.is_valid(0));
        let statistics_value_via_converter = minx_bbox_minx.value(0);

        assert_eq!(statistics_value_direct, statistics_value_via_converter)
    }
}

Expected behavior

Some way to handle struct columns in StatisticsConverter.

Additional context

It looks like this issue has been documented in datafusion, before the StatisticsConverter was moved into parquet core.

It doesn't look like there was previously an issue in this repo for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant