-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GroupBy.shuffle_to_chunks()
#9320
Conversation
58d01b2
to
1df705e
Compare
1df705e
to
60d7619
Compare
* main: Revise (pydata#9366) Fix rechunking to a frequency with empty bins. (pydata#9364) whats-new entry for dropping python 3.9 (pydata#9359) drop support for `python=3.9` (pydata#8937) Revise (pydata#9357) try to fix scheduled hypothesis test (pydata#9358)
* main: fix html repr indexes section (pydata#9768) Bump pypa/gh-action-pypi-publish from 1.11.0 to 1.12.2 in the actions group (pydata#9763) unpin array-api-strict, as issues are resolved upstream (pydata#9762) rewrite the `min_deps_check` script (pydata#9754) CI runs ruff instead of pep8speaks (pydata#9759) Specify copyright holders in main license file (pydata#9756) Compress PNG files (pydata#9747) Dispatch to Dask if nanquantile is available (pydata#9719) Updates to Dask page in Xarray docs (pydata#9495) http:// → https:// (pydata#9748) Discard useless `!s` conversion in f-string (pydata#9752) Apply ruff/flake8-simplify rule SIM401 (pydata#9749) Use micromamba 1.5.10 where conda is needed (pydata#9737) pin array-api-strict<=2.1 (pydata#9751) Reorganise ruff rules (pydata#9738) use new conda-forge package pydap-server (pydata#9741)
We could also just have Then # Save the groupby kwargs since we need it twice
groupers = {"label": UniqueGrouper()}
# get the shuffled dataset
shuffled: Dataset = ds.groupby(groupers).shuffle()
# Now group the shuffled dataset the same way and
# map a UDF that expects all members of a group in memory at the same time.
shuffled.groupby(groupers).map(UDF) Alternative names could be After thinking about it a while, This is all fairly advanced stuff, so we can start with the smallest possible API surface, which would be cc @shoyer |
Very much agree with all of your message —
(If it were |
Would it be too verbose to call this |
For the method is on the Otherwise we could workshop |
* main: Bump minimum versions (pydata#9796) Namespace-aware `xarray.ufuncs` (pydata#9776) Add prettier and pygrep hooks to pre-commit hooks (pydata#9644) `rolling.construct`: Add `sliding_window_kwargs` to pipe arguments down to `sliding_window_view` (pydata#9720) Bump codecov/codecov-action from 4.6.0 to 5.0.2 in the actions group (pydata#9793) Buffer types (pydata#9787) Add download stats badges (pydata#9786) Fix open_mfdataset for list of fsspec files (pydata#9785) add 'User-Agent'-header to pooch.retrieve (pydata#9782) Optimize `ffill`, `bfill` with dask when `limit` is specified (pydata#9771) fix cf decoding of grid_mapping (pydata#9765) Allow wrapping `np.ndarray` subclasses (pydata#9760) Optimize polyfit (pydata#9766) Use `map_overlap` for rolling reductions with Dask (pydata#9770)
GroupBy.distributed_shuffle()
OK I went with gb = array.groupby_bins("dim_0", bins=bins, **cut_kwargs)
shuffled: DataArray = gb.distributed_shuffle()
gb = shuffle.groupby_bins("dim_0", bins=bins, **cut_kwargs) I couldn't think of a shorter name, but happy to change it. |
74d7bb0
to
c77d7c5
Compare
6951e6a
to
bccacfe
Compare
how about |
Looks good! (No strong view on the exact name...) |
GroupBy.distributed_shuffle()
GroupBy.shuffle_to_chunks()
OK I'm going with shuffle_to_chunks. Will merge in ~24 hours. It's been around a while now :) |
Sounds good!
…On Wed, Nov 20, 2024 at 3:58 PM Deepak Cherian ***@***.***> wrote:
OK I'm going with shuffle_to_chunks. Will merge in ~12 hours. It's been
around a while now :)
—
Reply to this email directly, view it on GitHub
<#9320 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJJFVXAWGTXX62ULVGF6L32BUOZVAVCNFSM6AAAAABMDO3MWGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBZG44DEMBZGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Really like the new name! |
Thanks for the input! |
This adds some new API to shuffle an Xarray object. Shuffling means we sort so that members of a group occur in the same chunk, with the possibility of multiple groups in a single chunk.
I've also added
shuffle_by
to DataArray and Dataset. This generalizessortby
, and lets you persist a shuffled Xarray object to disk.whats-new.rst
api.rst
2024.08.1
chunks
to signature.group
is chunkedcc @phofl