[TPC-H] `nation` and `region` partitioning is nonsensical #1380

hendrikmakait · 2024-02-13T06:42:41Z

The nation and region datasets contain a single value per partition (25 and 5 values, respectively. While dask-expr and others should ideally be smart enough to detect this and act accordingly, this partitioning is nonsensical and not representative of a real workload.

The text was updated successfully, but these errors were encountered:

milesgranger · 2024-02-13T08:06:34Z

this partitioning is nonsensical and not representative of a real workload.

I don't know. I've seen lots of very imbalanced partitioning on raw data layers. Some vendors often dumping small updates into a bucket at hourly increments then combined with other vendor's larger weekly deposits; not so dissimilar to what we see here IMO.

Do we have any idea how much these tables impact performance when, for scale 1000 there are 25 nation files, and for lineitem there are 3000. Are the 25 and 5 small files really that degrading to performance? If so, it seems like that by itself is an issue, no?

Edit: I see the perceived performance impact in #1381, but I'd still maintain this isn't such a deviation from real world workloads. Maybe in a transformed layer this would be weird, but certainly not for raw layers.

hendrikmakait · 2024-02-13T08:19:40Z

I don't know. I've seen lots of very imbalanced partitioning on raw data layers. Some vendors often dumping small updates into a bucket at hourly increments then combined with other vendor's larger weekly deposits; not so dissimilar to what we see here IMO.

I see your point about updates, but these are dimension tables that are slowly to never changing. I see a point for imbalanced data on basically all other tables, but for these two, I consider this a bad practice (and I would've had a very stern talk with that engineer back in my days working in business intelligence).

milesgranger · 2024-02-13T08:27:56Z

and I would've had a very stern talk with that engineer back in my days working in business intelligence

Sometimes people don't have the luxury of telling vendors what size and when they get their data, it's outside of the organization, and is common. The job of the organization's data engineers is to transform this "bad" unoptimized data files/sources into something "good" for the organization to use, by using engines like Dask.

phofl · 2024-02-13T08:39:55Z

Let’s open an issue that we should be able to detect those and then use a more sensible partitioning.

the idea of what we are doing at the moment is to identify potential problems, if a very weird partitioning hides other issues then we don‘t gain anything and shoot ourselves in the foot

hendrikmakait · 2024-02-13T09:09:43Z

As an aside: The nation and region datasets are not partitioned if you generate the data locally.

I had a quick offline discussion with @milesgranger and we agree that:

Generally, Dask should be able to be better at dealing with imbalanced datasets and small partitions stemming, e.g., from incremental updates.
We should not touch the imbalance on the other datasets.
For this particular instance, the partitioning makes little sense.

Where we disagree is on the question whether to merge:
I'm advocating for it since it is unrealistic for this benchmark and does not match the data generated locally.
@milesgranger is advocating against it because it highlights things that Dask is bad at and will come up in other workloads (like incremental updates).

I'd still suggest merge the partitions, and additionally add a ticket as suggested by @phofl, as well as re-test this once we have become better at estimating sizes/cost in dask-expr.

phofl · 2024-02-13T09:16:26Z

I'd still suggest merge the partitions, and additionally add a ticket as suggested by @phofl, as well as re-test this once we have become better at estimating sizes/cost in dask-expr.

FWIW we could do this today in dask-expr, but we need information from read_parquet to use a more sensible partitioning. It's very similar to what we do with projections. Touching the parquet implementation before we rewrite it seems not smart though

hendrikmakait · 2024-02-13T10:32:00Z

I've created an upstream issue: dask/dask-expr#869

hendrikmakait · 2024-02-13T16:44:21Z

For now, I've manually replaced the partitioned datasets with unpartitioned ones to move forward with benchmarking. I have created #1386 to track the work necessary to update the data generation script.

hendrikmakait added the tpch label Feb 13, 2024

hendrikmakait changed the title ~~[TPC-H] nation and region partitioning is non-sensical~~ [TPC-H] nation and region partitioning is nonsensical Feb 13, 2024

hendrikmakait mentioned this issue Feb 13, 2024

[TPC-H] Query 9 memory issue #1381

Open

hendrikmakait mentioned this issue Feb 13, 2024

[TPC-H] Query 11 and 13 memory issue #1382

Open

hendrikmakait assigned milesgranger Feb 13, 2024

hendrikmakait mentioned this issue Feb 13, 2024

Adjust TPC-H data generation script to create a single .parquet files for nation and region #1386

Closed

hendrikmakait closed this as completed Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TPC-H] `nation` and `region` partitioning is nonsensical #1380

[TPC-H] `nation` and `region` partitioning is nonsensical #1380

hendrikmakait commented Feb 13, 2024 •

edited

Loading

milesgranger commented Feb 13, 2024 •

edited

Loading

hendrikmakait commented Feb 13, 2024

milesgranger commented Feb 13, 2024

phofl commented Feb 13, 2024

hendrikmakait commented Feb 13, 2024

phofl commented Feb 13, 2024

hendrikmakait commented Feb 13, 2024

hendrikmakait commented Feb 13, 2024

[TPC-H] nation and region partitioning is nonsensical #1380

[TPC-H] nation and region partitioning is nonsensical #1380

Comments

hendrikmakait commented Feb 13, 2024 • edited Loading

milesgranger commented Feb 13, 2024 • edited Loading

hendrikmakait commented Feb 13, 2024

milesgranger commented Feb 13, 2024

phofl commented Feb 13, 2024

hendrikmakait commented Feb 13, 2024

phofl commented Feb 13, 2024

hendrikmakait commented Feb 13, 2024

hendrikmakait commented Feb 13, 2024

[TPC-H] `nation` and `region` partitioning is nonsensical #1380

[TPC-H] `nation` and `region` partitioning is nonsensical #1380

hendrikmakait commented Feb 13, 2024 •

edited

Loading

milesgranger commented Feb 13, 2024 •

edited

Loading