You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Aug 19, 2024. It is now read-only.
Copy file name to clipboardExpand all lines: Partitioned.md
+5-5
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
**Draft**
2
2
3
3
# Motivation
4
-
When operating in distributed memory systems a data container (such as tensors or data-frames) may be partitioned into several smaller chunks which might be allocated in distinct (distributed) address spaces. An implementation of operations defined specifically for such a partitioned data container will likely need to use specifics of the structure to provide good performance. Most prominently, making use of the data locality can be vital. For a consumer of such a partitioned container it will be inconvenient (if not impossible) to write specific code for any possible partitioned and distributed data container. On the other hand, like with the array interface, it would be much better to have a standard way of extracting the meta data about the partitioned and distributed nature of a given container.
4
+
When operating in distributed memory systems a data container (such as tensors or data-frames) may be partitioned into several smaller chunks which might be allocated in distinct (distributed) address spaces. An implementation of operations defined specifically for such a partitioned data container will likely need to use specifics of the structure to provide good performance. Most prominently, making use of the data locality can be vital. For a consumer of such a partitioned container it will be inconvenient (if not impossible) to write specific code for any possible partitioned and distributed data container. On the other hand, like with the array/dataframe API interface, it would be much better to have a standard way of extracting the meta data about the partitioned and distributed nature of a given container.
5
5
6
6
The goal of the `__partitioned__` protocol is to allow partitioned and distributed data containers to expose information to consumers so that unnecessary copying can be avoided as much as possible.
7
7
@@ -51,11 +51,11 @@ In addition to the above required keys a container is encouraged to provide more
51
51
* The actual data of the partition, potentially provided as a handle.
52
52
* All data/partitions must be of the same type.
53
53
* The actual partition type is opaque to the `__partitioned__`
54
-
* The consumer needs to deal with it like it would in a non-partitioned case. For example, if the consumer can deal with array it could check for the existence of the array interface. Other conforming consumers could hard-code a very specific type check.
54
+
* The consumer needs to deal with it like it would in a non-partitioned case. For example, if the consumer can deal with array it could check for the existence of the array/dataframe APIs (https://github.com/data-apis/array-api, https://github.com/data-apis/dataframe-api). Other conforming consumers could hard-code a very specific type check.
55
55
* Whenever possible the data should be provided as references. References are mandatory of non-SPMD backedns. This avoids unnecessary data movement.
56
56
* Ray: ray.ObjectRef
57
57
* Dask: dask.Future
58
-
* It is recommended to access the actual data through the callable in the 'get' field of `__partitioned__`. This allows consumers to avoid differentiating between different handle types and type checks can be limited to basic types like pandas.DataFrame and numpy.ndarray.
58
+
* It is recommended to access the actual data through the callable in the 'get' field of `__partitioned__`. This allows consumers to avoid checking handle types and container types.
59
59
60
60
* For SPMD-MPI-like backends: partitions which are not locally available may be `None`. This is the recommended behavior unless the underlying backend supports references (such as promises/futures) to avoid unnecessary data movement.
61
61
*`location`
@@ -66,7 +66,7 @@ In addition to the above required keys a container is encouraged to provide more
66
66
* SPMD/MPI-like frameworks such as MPI, SHMEM etc.: rank
67
67
68
68
## `get`
69
-
A callable must return raw data object when called with a handle (or sequence of handles) provided in the `data` field of an entry in `partition`. Raw data objects are standard data structures like pandas.DataFrame and numpy.ndarray.
69
+
A callable must return raw data object when called with a handle (or sequence of handles) provided in the `data` field of an entry in `partition`. Raw data objects are standard data structures: DataFrame or nd-array.
70
70
71
71
## `locals`
72
72
The short-cut for SPMD environments allows processes/ranks to quickly extract their local partition. It saves processes from parsing the `partitions` dictionary for the local rank/address which is helpful when the number of ranks/processes/PEs is large.
### 2d-structure (64 elements), 1d-partition-grid, 4 partitions on 2 ranks, row-block-cyclic distribution, partitions are of type `pandas.DataFrame`, MPI
0 commit comments