Skip to content

Reading from parquet files with more than one row group triggers expensive statistics collection #363

Open
@phofl

Description

@phofl

cc @rjzamora

We should turn this off by default and use the logic one file = one partition. Could you take a look? I am not very familiar with the read_parquet stuff.

def read_data(filename):
    path = "s3://coiled-runtime-ci/tpc-h/scale-100/" + filename + "/"
    return dd.read_parquet(path, engine="pyarrow", filesystem=None)

if __name__ == "__main__":
    from distributed import Client
    client = Client()

    var1 = datetime.strptime("1995-01-01", "%Y-%m-%d")
    var2 = datetime.strptime("1997-01-01", "%Y-%m-%d")

    line_item_ds = read_data("lineitem")

    lineitem_filtered = line_item_ds[
        (line_item_ds["l_shipdate"] >= var1) & (line_item_ds["l_shipdate"] < var2)
        ]
    lineitem_filtered["l_year"] = 1  # lineitem_filtered["l_shipdate"].dt.year
    lineitem_filtered["revenue"] = lineitem_filtered["l_extendedprice"] * (
        1.0 - lineitem_filtered["l_discount"]
    )

    lineitem_filtered.optimize()

This is how we found that particular issue

cc @fjetter

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions