Reading from parquet files with more than one row group triggers expensive statistics collection

cc @rjzamora 

We should turn this off by default and use the logic one file = one partition. Could you take a look? I am not very familiar with the read_parquet stuff.

```
def read_data(filename):
    path = "s3://coiled-runtime-ci/tpc-h/scale-100/" + filename + "/"
    return dd.read_parquet(path, engine="pyarrow", filesystem=None)

if __name__ == "__main__":
    from distributed import Client
    client = Client()

    var1 = datetime.strptime("1995-01-01", "%Y-%m-%d")
    var2 = datetime.strptime("1997-01-01", "%Y-%m-%d")

    line_item_ds = read_data("lineitem")

    lineitem_filtered = line_item_ds[
        (line_item_ds["l_shipdate"] >= var1) & (line_item_ds["l_shipdate"] < var2)
        ]
    lineitem_filtered["l_year"] = 1  # lineitem_filtered["l_shipdate"].dt.year
    lineitem_filtered["revenue"] = lineitem_filtered["l_extendedprice"] * (
        1.0 - lineitem_filtered["l_discount"]
    )

    lineitem_filtered.optimize()

```

This is how we found that particular issue

cc @fjetter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Reading from parquet files with more than one row group triggers expensive statistics collection #363

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Reading from parquet files with more than one row group triggers expensive statistics collection #363

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions