-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet statistics are collected twice #365
Comments
My naive expectation would be that the second collection is not necessary since we're doing some caching. Is this a false cache miss or do we actually have to collect this twice for some reason? |
We only want to check the first file. If we are collecting all statistics, then we are using |
I'm surprised that this line requires us checking anything. |
Can you clarify what you mean by "this line"? |
lineitem_filtered["sum_disc_price"] = lineitem_filtered.l_extendedprice * (
1 - lineitem_filtered.l_discount
) this is triggering the collection. I assume it steps into some "I have two frames that need to be aligned, let's better check if we know the divisions" logic that is not necessary here |
Patrick's example in #363 (comment) is essentially the same. He defines the dataframe code up to this point and then calls |
Oh okay - Yeah, then I agree. We shouldn't need to do anything "new" at that step. In general, the dask-expr API will trigger whatever logic is necessary to calculate the current |
|
Okay, I spent some time looking into this and the problem is that an element-wise operation between collections will currently trigger an eager expr.are_co_aligned(self.expr, other) check. This means that we end up asking the root IO expression for it's divisions before any kind of optimization pass (column projection, combine similar, etc). Later on, when we call I'm pretty sure I can revise the |
This is an improvement for this particular problem, the general issue remains though |
Currently, the parquet reader is using
split_row_groups='infer'
to infer whether it should split a file into multiple dask tasks or not which is done by collecting parquet statistics from every file.Running the code below triggers this statistics collection twice. Once during the definition of the computation / while I'm mutating the dataframe and once as soon as I call compute.
related to #363
The text was updated successfully, but these errors were encountered: