Add `from_partitioned` to create Dataset from any data source implements `partitioned` #18966

kira-lin · 2021-09-29T05:42:48Z

Why are these changes needed?

We intend to propose a protocol to make large, distributed, partitioned data exchange between frameworks(like ray, modin, dask) easier. Several PRs are in progress, please check here.

It's also possible to do for ray dataset, but to start with, MLDataset is simpler.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

ericl · 2021-09-29T06:47:29Z

MLDataset is deprecated, so I don't think we should be accepting patches to it.

kira-lin · 2021-09-29T08:11:45Z

Yes, I can do this for ray dataset. I just want to first discuss about this protocol. @fschlimb

ericl · 2021-09-29T18:37:36Z

Sounds good. Btw, the integration point for Datasets would be to define a custom datasource (e.g., PartitionedDatasource or similar), via the datasource API: https://github.com/ray-project/ray/blob/master/python/ray/data/datasource/datasource.py

kira-lin · 2021-10-08T08:04:16Z

I did not use Datasource API for a few reasons. 1. data can be in ray object store before from_partitioned 2. we want to utilize data's locality given by location in the partitions.

kira-lin · 2021-10-08T08:04:48Z

I did not use Datasource API for a few reasons:

data can be in ray object store before from_partitioned
we want to utilize data's locality given by location in the partitions.

fschlimb · 2021-10-08T10:14:45Z

I did not use Datasource API for a few reasons:

data can be in ray object store before from_partitioned

we want to utilize data's locality given by location in the partitions.

Why is that not possible with Datasource?

fschlimb · 2021-10-08T10:15:44Z

Broader discussion started here: data-apis/consortium-feedback#7

ericl · 2021-10-08T16:24:50Z

Why is that not possible with Datasource?

I believe both of these are possible with Datasource. For locality, Ray will internally manage locality-aware execution, the use of node: labels is not recommended since it interferes with auto-scaling and fault tolerance.

kira-lin · 2021-10-09T03:28:39Z

The data could be in dask, that make things hard. Anyway, I'll update it once the protocol settles

ericl · 2021-10-15T22:53:06Z

If the data is in Dask on Ray, then locality scheduling will apply to those objects. We don't support Dask unless it's run via Dask on Ray.

Alternatively, we could modify the data source API to allow custom read tasks to be generated (e.g., a PartitionedDataSource could generate read tasks that run on specific nodes according to locality).

Can you let me know if one of the above alternatives works?

ericl · 2021-10-15T22:58:51Z

python/ray/data/read_api.py

@@ -511,6 +511,29 @@ def from_modin(df: "modin.DataFrame") -> Dataset[ArrowRow]:
    parts = unwrap_partitions(df, axis=0)
    return from_pandas(parts)

+def from_partitioned(data) -> Dataset[ArrowRow]:


This should be refactored into a PartitionedDataSource.

fschlimb · 2021-10-25T12:18:01Z

Why is that not possible with Datasource?

I believe both of these are possible with Datasource. For locality, Ray will internally manage locality-aware execution, the use of node: labels is not recommended since it interferes with auto-scaling and fault tolerance.

Yes, this is generally understood, and as long as we consider "native" ray objects only, this will work just fine without anything special. If we consider data which comes from somewhere else (say, like "dask" or "YetOnotherFancyFamework") the protocol allows us to manually guarantee locality - probably using label:node would be most appropriate - if only for tasks which put the data into ray space. The protocol tries to allow this kind of interoperability without requiring consumers to necessarily support all frameworks.

Notice: ray limits the possibilities to support proper zero-copy when consuming non-ray objects. But at least we can avoid data transfer between nodes.

bveeramani · 2022-01-30T05:37:18Z

‼️ ACTION REQUIRED ‼️

We've switched our code formatter from YAPF to Black (see #21311).

To prevent issues with merging your code, here's what you'll need to do:

Install Black

pip install -I black==21.12b0

Format changed files with Black

curl -o format-changed.sh https://gist.githubusercontent.com/bveeramani/42ef0e9e387b755a8a735b084af976f2/raw/7631276790765d555c423b8db2b679fd957b984a/format-changed.sh
chmod +x ./format-changed.sh
./format-changed.sh
rm format-changed.sh

Commit your changes.

git add --all
git commit -m "Format Python code with Black"

Merge master into your branch.

git pull upstream master

Resolve merge conflicts (if necessary).

After running these steps, you'll have the updated format.sh.

stale · 2022-03-18T00:03:33Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stale · 2022-04-03T14:55:48Z

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

kira-lin added 2 commits August 25, 2021 13:47

add from_partitioned

34e9d3f

update

c080e71

kira-lin added 2 commits September 29, 2021 09:56

Merge remote-tracking branch 'upstream/master' into from_partitioned

deb93fb

pass lambda

745a707

use ray dataset

094ee42

rkooo567 assigned ericl and clarkzinzow Oct 15, 2021

ericl reviewed Oct 15, 2021

View reviewed changes

ericl changed the title ~~Add from_partitioned to create MLDataset from any data source implements partitioned~~ Add from_partitioned to create Dataset from any data source implements partitioned Oct 15, 2021

ericl added the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label Oct 15, 2021

ericl removed their assignment Oct 21, 2021

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 18, 2022

stale bot closed this Apr 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `from_partitioned` to create Dataset from any data source implements `partitioned` #18966

Add `from_partitioned` to create Dataset from any data source implements `partitioned` #18966

kira-lin commented Sep 29, 2021

ericl commented Sep 29, 2021

kira-lin commented Sep 29, 2021

ericl commented Sep 29, 2021

kira-lin commented Oct 8, 2021

kira-lin commented Oct 8, 2021

fschlimb commented Oct 8, 2021

fschlimb commented Oct 8, 2021

ericl commented Oct 8, 2021

kira-lin commented Oct 9, 2021

ericl commented Oct 15, 2021 •

edited

Loading

ericl Oct 15, 2021

fschlimb commented Oct 25, 2021 •

edited

Loading

bveeramani commented Jan 30, 2022

stale bot commented Mar 18, 2022

stale bot commented Apr 3, 2022

Add from_partitioned to create Dataset from any data source implements partitioned #18966

Add from_partitioned to create Dataset from any data source implements partitioned #18966

Conversation

kira-lin commented Sep 29, 2021

Why are these changes needed?

Related issue number

Checks

ericl commented Sep 29, 2021

kira-lin commented Sep 29, 2021

ericl commented Sep 29, 2021

kira-lin commented Oct 8, 2021

kira-lin commented Oct 8, 2021

fschlimb commented Oct 8, 2021

fschlimb commented Oct 8, 2021

ericl commented Oct 8, 2021

kira-lin commented Oct 9, 2021

ericl commented Oct 15, 2021 • edited Loading

ericl Oct 15, 2021

Choose a reason for hiding this comment

fschlimb commented Oct 25, 2021 • edited Loading

bveeramani commented Jan 30, 2022

‼️ ACTION REQUIRED ‼️

stale bot commented Mar 18, 2022

stale bot commented Apr 3, 2022

Add `from_partitioned` to create Dataset from any data source implements `partitioned` #18966

Add `from_partitioned` to create Dataset from any data source implements `partitioned` #18966

ericl commented Oct 15, 2021 •

edited

Loading

fschlimb commented Oct 25, 2021 •

edited

Loading