Skip to content
This repository was archived by the owner on Aug 19, 2024. It is now read-only.

Commit a32583e

Browse files
authored
use get_client in dask example
1 parent 635f881 commit a32583e

File tree

1 file changed

+5
-5
lines changed

1 file changed

+5
-5
lines changed

Partitioned.md

+5-5
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
**Draft**
22

33
# Motivation
4-
When operating in distributed memory systems a data container (such as tensors or data-frames) may be partitioned into several smaller chunks which might be allocated in distinct (distributed) address spaces. An implementation of operations defined specifically for such a partitioned data container will likely need to use specifics of the structure to provide good performance. Most prominently, making use of the data locality can be vital. For a consumer of such a partitioned container it will be inconvenient (if not impossible) to write specific code for any possible partitioned and distributed data container. On the other hand, like with the array interface, it would be much better to have a standard way of extracting the meta data about the partitioned and distributed nature of a given container.
4+
When operating in distributed memory systems a data container (such as tensors or data-frames) may be partitioned into several smaller chunks which might be allocated in distinct (distributed) address spaces. An implementation of operations defined specifically for such a partitioned data container will likely need to use specifics of the structure to provide good performance. Most prominently, making use of the data locality can be vital. For a consumer of such a partitioned container it will be inconvenient (if not impossible) to write specific code for any possible partitioned and distributed data container. On the other hand, like with the array/dataframe API interface, it would be much better to have a standard way of extracting the meta data about the partitioned and distributed nature of a given container.
55

66
The goal of the `__partitioned__` protocol is to allow partitioned and distributed data containers to expose information to consumers so that unnecessary copying can be avoided as much as possible.
77

@@ -51,11 +51,11 @@ In addition to the above required keys a container is encouraged to provide more
5151
* The actual data of the partition, potentially provided as a handle.
5252
* All data/partitions must be of the same type.
5353
* The actual partition type is opaque to the `__partitioned__`
54-
* The consumer needs to deal with it like it would in a non-partitioned case. For example, if the consumer can deal with array it could check for the existence of the array interface. Other conforming consumers could hard-code a very specific type check.
54+
* The consumer needs to deal with it like it would in a non-partitioned case. For example, if the consumer can deal with array it could check for the existence of the array/dataframe APIs (https://github.com/data-apis/array-api, https://github.com/data-apis/dataframe-api). Other conforming consumers could hard-code a very specific type check.
5555
* Whenever possible the data should be provided as references. References are mandatory of non-SPMD backedns. This avoids unnecessary data movement.
5656
* Ray: ray.ObjectRef
5757
* Dask: dask.Future
58-
* It is recommended to access the actual data through the callable in the 'get' field of `__partitioned__`. This allows consumers to avoid differentiating between different handle types and type checks can be limited to basic types like pandas.DataFrame and numpy.ndarray.
58+
* It is recommended to access the actual data through the callable in the 'get' field of `__partitioned__`. This allows consumers to avoid checking handle types and container types.
5959

6060
* For SPMD-MPI-like backends: partitions which are not locally available may be `None`. This is the recommended behavior unless the underlying backend supports references (such as promises/futures) to avoid unnecessary data movement.
6161
* `location`
@@ -66,7 +66,7 @@ In addition to the above required keys a container is encouraged to provide more
6666
* SPMD/MPI-like frameworks such as MPI, SHMEM etc.: rank
6767

6868
## `get`
69-
A callable must return raw data object when called with a handle (or sequence of handles) provided in the `data` field of an entry in `partition`. Raw data objects are standard data structures like pandas.DataFrame and numpy.ndarray.
69+
A callable must return raw data object when called with a handle (or sequence of handles) provided in the `data` field of an entry in `partition`. Raw data objects are standard data structures: DataFrame or nd-array.
7070

7171
## `locals`
7272
The short-cut for SPMD environments allows processes/ranks to quickly extract their local partition. It saves processes from parsing the `partitions` dictionary for the local rank/address which is helpful when the number of ranks/processes/PEs is large.
@@ -130,7 +130,7 @@ __partitioned_interface__ = {
130130
'data': future3,
131131
'location': ['1.1.1.2:55667’], },
132132
}
133-
'get': lambda x: x.result()
133+
'get': lambda x: distributed.get_client().gather(x)
134134
}
135135
```
136136
### 2d-structure (64 elements), 1d-partition-grid, 4 partitions on 2 ranks, row-block-cyclic distribution, partitions are of type `pandas.DataFrame`, MPI

0 commit comments

Comments
 (0)