Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TPC-H] nation and region partitioning is nonsensical #1380

Closed
hendrikmakait opened this issue Feb 13, 2024 · 8 comments
Closed

[TPC-H] nation and region partitioning is nonsensical #1380

hendrikmakait opened this issue Feb 13, 2024 · 8 comments
Assignees
Labels

Comments

@hendrikmakait
Copy link
Member

hendrikmakait commented Feb 13, 2024

The nation and region datasets contain a single value per partition (25 and 5 values, respectively. While dask-expr and others should ideally be smart enough to detect this and act accordingly, this partitioning is nonsensical and not representative of a real workload.

@hendrikmakait hendrikmakait changed the title [TPC-H] nation and region partitioning is non-sensical [TPC-H] nation and region partitioning is nonsensical Feb 13, 2024
@milesgranger
Copy link
Contributor

milesgranger commented Feb 13, 2024

this partitioning is nonsensical and not representative of a real workload.

I don't know. I've seen lots of very imbalanced partitioning on raw data layers. Some vendors often dumping small updates into a bucket at hourly increments then combined with other vendor's larger weekly deposits; not so dissimilar to what we see here IMO.

Do we have any idea how much these tables impact performance when, for scale 1000 there are 25 nation files, and for lineitem there are 3000. Are the 25 and 5 small files really that degrading to performance? If so, it seems like that by itself is an issue, no?


Edit: I see the perceived performance impact in #1381, but I'd still maintain this isn't such a deviation from real world workloads. Maybe in a transformed layer this would be weird, but certainly not for raw layers.

@hendrikmakait
Copy link
Member Author

I don't know. I've seen lots of very imbalanced partitioning on raw data layers. Some vendors often dumping small updates into a bucket at hourly increments then combined with other vendor's larger weekly deposits; not so dissimilar to what we see here IMO.

I see your point about updates, but these are dimension tables that are slowly to never changing. I see a point for imbalanced data on basically all other tables, but for these two, I consider this a bad practice (and I would've had a very stern talk with that engineer back in my days working in business intelligence).

@milesgranger
Copy link
Contributor

and I would've had a very stern talk with that engineer back in my days working in business intelligence

Sometimes people don't have the luxury of telling vendors what size and when they get their data, it's outside of the organization, and is common. The job of the organization's data engineers is to transform this "bad" unoptimized data files/sources into something "good" for the organization to use, by using engines like Dask.

@phofl
Copy link
Contributor

phofl commented Feb 13, 2024

Let’s open an issue that we should be able to detect those and then use a more sensible partitioning.

the idea of what we are doing at the moment is to identify potential problems, if a very weird partitioning hides other issues then we don‘t gain anything and shoot ourselves in the foot

@hendrikmakait
Copy link
Member Author

As an aside: The nation and region datasets are not partitioned if you generate the data locally.

I had a quick offline discussion with @milesgranger and we agree that:

  • Generally, Dask should be able to be better at dealing with imbalanced datasets and small partitions stemming, e.g., from incremental updates.
  • We should not touch the imbalance on the other datasets.
  • For this particular instance, the partitioning makes little sense.

Where we disagree is on the question whether to merge:
I'm advocating for it since it is unrealistic for this benchmark and does not match the data generated locally.
@milesgranger is advocating against it because it highlights things that Dask is bad at and will come up in other workloads (like incremental updates).

I'd still suggest merge the partitions, and additionally add a ticket as suggested by @phofl, as well as re-test this once we have become better at estimating sizes/cost in dask-expr.

@phofl
Copy link
Contributor

phofl commented Feb 13, 2024

I'd still suggest merge the partitions, and additionally add a ticket as suggested by @phofl, as well as re-test this once we have become better at estimating sizes/cost in dask-expr.

FWIW we could do this today in dask-expr, but we need information from read_parquet to use a more sensible partitioning. It's very similar to what we do with projections. Touching the parquet implementation before we rewrite it seems not smart though

@hendrikmakait
Copy link
Member Author

I've created an upstream issue: dask/dask-expr#869

@hendrikmakait
Copy link
Member Author

For now, I've manually replaced the partitioned datasets with unpartitioned ones to move forward with benchmarking. I have created #1386 to track the work necessary to update the data generation script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants