-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate Dask Arrays properly #446
Comments
Mid term I would also like to have arrays be implemented with the expression system, if only to get rid of HLGs entirely |
I'm curious how much benefit a symbolic Dask array helps xarray. Probably less than an expression system for xarray itself, but how much less? Cc @dcherian who can maybe help is think through this. For example, probably the biggest benefit is more intelligent rechunking. Are chunks stored in the xarray data model or are they just given to the underlying duck arrays? |
A question I have struggled with. I don't think we really know without actually trying it out. To me, it seems like GroupBy is where an xarray level system makes a lot of sense.
Absolutely, specifically we'd want to be setting read-time chunk sizes that are adapted to the workload.
Just given to and taken from the underlying duck array. so a expression based dask array would be easy to slot in. |
For groupby stuff my recollection is that those optimizations are pretty local, right? You can maybe do that today with a little bit of logic in the current xarray code base? Maybe like how pandas does groupby? |
Mostly, my thought is that if we target the expression layer at Dask array it's maybe more likely to be done (y'all seem busy) but I'm nervous about not capturing enough of the value. If mostly all we care about is rechunking and if xarray can use systems like flox intelligently without a fancy expression system then we're good. If there are cases when xarray specific knowledge is valuable then it might not make as much sense to prioritize Dask array in expression form. |
For code complexity alone it makes sense to implement this. I don't think we have to be very sophisticated w.r.t optimizations but getting rid of HLGs would be a huge benefit for maintainability. |
Fine by me
…On Mon, Dec 4, 2023 at 12:50 AM Florian Jetter ***@***.***> wrote:
not make as much sense to prioritize Dask array in expression
For code complexity alone it makes sense to implement this. I don't think
we have to be very sophisticated w.r.t optimizations but getting rid of
HLGs would be a huge benefit for maintainability.
For HLG replacement I believe we don't need much more than a blockwise
expr and I suspect the Array class itself will be simpler since we don't
have to deal with meta and divisions.
I would really like to avoid ending up with three different systems, low
level, hlg and expressions. Arrays are still lagging behind in HLG adoption
which is already hurting us and I believe we should not make the same
mistake again.
I'd like to nuke HLGs in the next couple of months
—
Reply to this email directly, view it on GitHub
<#446 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTHPGWXEDAG3YRY2E2LYHWFHFAVCNFSM6AAAAABABTCOMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZYGA4TAMBZGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
To me, It feels like most of the value is in automatically chunking for the user, and removing that knob from most workloads. And today Xarray is engineered so that chunks are user-specified and provided to dask.array. How easy would it be to prototype an expression system for Xarray? a few hours at AGU? |
That sounds possible-but-optimistic. I'd have more confidence in prototyping a numpy/dask.array system and dropping it into xarray. Given your first answer above I'm inclined to focus more on dask.array anyway rather than xarray. Especially if the following is true:
Do you need a full expression system to do groupbys well? If the answer is "no" and it's easy to do a little light rewriting then my sense is that we don't pursue an xarray expression system, and just do arrays. Although maybe there are other reasons to have high level expressions for xarray, like if Earthmover wanted to record what queries were common within a customerbase. |
I think the answer is "probably not". I made a proposal for these higher level objects that can return "preferred chunks". We would still need to propagate these chunking heuristics down to read-time, but that seems like it could be done at the array level. Sounds like you'd want to migrate dask.array to an expression system anyway, so let's prototype that and see where we get? |
Looking through the API, I'm guessing that our hierarchy is mostly based on the following:
There are some other outliers like the following:
I'll bet that an initial implementation focuses on from_array, blockwise, reductions, and slicing. I think that we can deliver a lot of functionality with those. Of course, blockwise and slicing are both pretty hairy and have, I suspect, bus factors of zero today. Someone (maybe me) probably has to go and do those first. Then it's probably easy for other people to come on afterwards. |
I wonder if @shoyer has thoughts on the advantages of building an expression system on xarray rather than dask.array. Such a thing would seem to resemble xarray-beam in some ways. It seems like the classic issues
are solvable at the dask.array level. |
Very interested to see this work! A couple of questions and thoughts (based on my work in Cubed):
|
It is not a goal of mine (I mostly just work on Dask and Coiled) but I don't think it would be hard to do. I recommend looking at the current dataframe implementation. Much of the code is about modeling the pandas API, then there are some methods/protocols like To be clear, I have no intention of spending energy to support other systems, but if other people want to come in and collaborate I'm sure we could make space for them.
We know the size of every chunk and the dtype, so presumably yes, on a per-chunk basis. |
We can't currently go from DataFrames to arrays, #445 adds
to_dask_array
but this is only a bandaid for now.I think in an ideal world we have an Array Collection that captures something like
.values
, then you can do array computations and go back to a DataFrame collection. Downside is that an Array collection without methods is not very helpfulI am not planning on working on this immediately, but wanted to collect thoughts on this topic
cc @mrocklin @rjzamora
The text was updated successfully, but these errors were encountered: