-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eagerly computing results limits potential for optimization #826
Comments
With "eagerly computing" you are referring to stuff like head, tail, Len, ...., correct? Just to avoid confusion with the set_index space |
Yes, I loosely refer to everything that calls |
IMHO it depends. |
Why would head be considered a "quick and dirty" development tool? IMO, there are plenty of use cases where I just want to compute the top K of something. Honestly, I don't care too much about |
I guess that's just my coding habits. Yes, if head() is used in production code it should definitely be lazy by default. |
Problem
By eagerly computing, e.g., when calling
.head()
, we limit the potential for optimizations. For example, it shouldn't matter whether I calldf.head(n)[["a", "b"]]
ordf[["a", "b"]].head(n)
. With eager computations, we lose the opportunity to push the column projection before the head selection.Apart from that, in my opinion that eager computation comes as a surprise and makes it harder to argue about when Dask actually computes things.
Proposed solution(s)
In general, limit eager computation as much as possible. For this particular example: Deprecate
compute=True
as default and switch tocompute=False
in the future.The text was updated successfully, but these errors were encountered: