Parquet statistics are collected twice

Currently, the parquet reader is using `split_row_groups='infer'` to infer whether it should split a file into multiple dask tasks or not which is done by collecting parquet statistics from every file.

Running the code below triggers this statistics collection twice. Once during the definition of the computation / while I'm mutating the dataframe and once as soon as I call compute.

```python
import dask_expr as dd

VAR1 = datetime(1998, 9, 2)

lineitem_ds = dd.read_parquet("s3://coiled-runtime-ci/tpc-h/scale-1000/lineitem")

lineitem_filtered = lineitem_ds[lineitem_ds.l_shipdate <= VAR1]
lineitem_filtered["sum_qty"] = lineitem_filtered.l_quantity
lineitem_filtered["sum_base_price"] = lineitem_filtered.l_extendedprice
lineitem_filtered["avg_qty"] = lineitem_filtered.l_quantity
lineitem_filtered["avg_price"] = lineitem_filtered.l_extendedprice

# This line now triggers a statistics collection iff `split_row_groups` is not False
lineitem_filtered["sum_disc_price"] = lineitem_filtered.l_extendedprice * (
    1 - lineitem_filtered.l_discount
)


lineitem_filtered["sum_charge"] = (
    lineitem_filtered.l_extendedprice
    * (1 - lineitem_filtered.l_discount)
    * (1 + lineitem_filtered.l_tax)
)

lineitem_filtered["avg_disc"] = lineitem_filtered.l_discount
lineitem_filtered["count_order"] = lineitem_filtered.l_discount

gb = lineitem_filtered.groupby(["l_returnflag", "l_linestatus"])

total = gb.agg(
    {
        "sum_qty": "sum",
        "sum_base_price": "sum",
        "sum_disc_price": "sum",
        "sum_charge": "sum",
        "avg_qty": "mean",
        "avg_price": "mean",
        "avg_disc": "mean",
        "count_order": "count",
    }
)

# Once I compute, another stats collection is triggered
total.compute()
```


![image](https://github.com/dask-contrib/dask-expr/assets/8629629/ae45d60e-5a32-4bbf-8bfc-fe29c9c0c71e)


related to https://github.com/dask-contrib/dask-expr/issues/363

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Parquet statistics are collected twice #365

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Parquet statistics are collected twice #365

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions