Open
Description
cc @rjzamora
We should turn this off by default and use the logic one file = one partition. Could you take a look? I am not very familiar with the read_parquet stuff.
def read_data(filename):
path = "s3://coiled-runtime-ci/tpc-h/scale-100/" + filename + "/"
return dd.read_parquet(path, engine="pyarrow", filesystem=None)
if __name__ == "__main__":
from distributed import Client
client = Client()
var1 = datetime.strptime("1995-01-01", "%Y-%m-%d")
var2 = datetime.strptime("1997-01-01", "%Y-%m-%d")
line_item_ds = read_data("lineitem")
lineitem_filtered = line_item_ds[
(line_item_ds["l_shipdate"] >= var1) & (line_item_ds["l_shipdate"] < var2)
]
lineitem_filtered["l_year"] = 1 # lineitem_filtered["l_shipdate"].dt.year
lineitem_filtered["revenue"] = lineitem_filtered["l_extendedprice"] * (
1.0 - lineitem_filtered["l_discount"]
)
lineitem_filtered.optimize()
This is how we found that particular issue
cc @fjetter
Metadata
Metadata
Assignees
Labels
No labels