Open
Description
Currently, the parquet reader is using split_row_groups='infer'
to infer whether it should split a file into multiple dask tasks or not which is done by collecting parquet statistics from every file.
Running the code below triggers this statistics collection twice. Once during the definition of the computation / while I'm mutating the dataframe and once as soon as I call compute.
import dask_expr as dd
VAR1 = datetime(1998, 9, 2)
lineitem_ds = dd.read_parquet("s3://coiled-runtime-ci/tpc-h/scale-1000/lineitem")
lineitem_filtered = lineitem_ds[lineitem_ds.l_shipdate <= VAR1]
lineitem_filtered["sum_qty"] = lineitem_filtered.l_quantity
lineitem_filtered["sum_base_price"] = lineitem_filtered.l_extendedprice
lineitem_filtered["avg_qty"] = lineitem_filtered.l_quantity
lineitem_filtered["avg_price"] = lineitem_filtered.l_extendedprice
# This line now triggers a statistics collection iff `split_row_groups` is not False
lineitem_filtered["sum_disc_price"] = lineitem_filtered.l_extendedprice * (
1 - lineitem_filtered.l_discount
)
lineitem_filtered["sum_charge"] = (
lineitem_filtered.l_extendedprice
* (1 - lineitem_filtered.l_discount)
* (1 + lineitem_filtered.l_tax)
)
lineitem_filtered["avg_disc"] = lineitem_filtered.l_discount
lineitem_filtered["count_order"] = lineitem_filtered.l_discount
gb = lineitem_filtered.groupby(["l_returnflag", "l_linestatus"])
total = gb.agg(
{
"sum_qty": "sum",
"sum_base_price": "sum",
"sum_disc_price": "sum",
"sum_charge": "sum",
"avg_qty": "mean",
"avg_price": "mean",
"avg_disc": "mean",
"count_order": "count",
}
)
# Once I compute, another stats collection is triggered
total.compute()
related to #363
Metadata
Metadata
Assignees
Labels
No labels