Skip to content

Parquet statistics are collected twice #365

Open
@fjetter

Description

@fjetter

Currently, the parquet reader is using split_row_groups='infer' to infer whether it should split a file into multiple dask tasks or not which is done by collecting parquet statistics from every file.

Running the code below triggers this statistics collection twice. Once during the definition of the computation / while I'm mutating the dataframe and once as soon as I call compute.

import dask_expr as dd

VAR1 = datetime(1998, 9, 2)

lineitem_ds = dd.read_parquet("s3://coiled-runtime-ci/tpc-h/scale-1000/lineitem")

lineitem_filtered = lineitem_ds[lineitem_ds.l_shipdate <= VAR1]
lineitem_filtered["sum_qty"] = lineitem_filtered.l_quantity
lineitem_filtered["sum_base_price"] = lineitem_filtered.l_extendedprice
lineitem_filtered["avg_qty"] = lineitem_filtered.l_quantity
lineitem_filtered["avg_price"] = lineitem_filtered.l_extendedprice

# This line now triggers a statistics collection iff `split_row_groups` is not False
lineitem_filtered["sum_disc_price"] = lineitem_filtered.l_extendedprice * (
    1 - lineitem_filtered.l_discount
)


lineitem_filtered["sum_charge"] = (
    lineitem_filtered.l_extendedprice
    * (1 - lineitem_filtered.l_discount)
    * (1 + lineitem_filtered.l_tax)
)

lineitem_filtered["avg_disc"] = lineitem_filtered.l_discount
lineitem_filtered["count_order"] = lineitem_filtered.l_discount

gb = lineitem_filtered.groupby(["l_returnflag", "l_linestatus"])

total = gb.agg(
    {
        "sum_qty": "sum",
        "sum_base_price": "sum",
        "sum_disc_price": "sum",
        "sum_charge": "sum",
        "avg_qty": "mean",
        "avg_price": "mean",
        "avg_disc": "mean",
        "count_order": "count",
    }
)

# Once I compute, another stats collection is triggered
total.compute()

image

related to #363

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions