Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add from_partitioned to create Dataset from any data source implements partitioned #18966

Closed
wants to merge 5 commits into from

Conversation

kira-lin
Copy link
Contributor

Why are these changes needed?

We intend to propose a protocol to make large, distributed, partitioned data exchange between frameworks(like ray, modin, dask) easier. Several PRs are in progress, please check here.

It's also possible to do for ray dataset, but to start with, MLDataset is simpler.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ericl
Copy link
Contributor

ericl commented Sep 29, 2021

MLDataset is deprecated, so I don't think we should be accepting patches to it.

@kira-lin
Copy link
Contributor Author

Yes, I can do this for ray dataset. I just want to first discuss about this protocol. @fschlimb

@ericl
Copy link
Contributor

ericl commented Sep 29, 2021

Sounds good. Btw, the integration point for Datasets would be to define a custom datasource (e.g., PartitionedDatasource or similar), via the datasource API: https://github.com/ray-project/ray/blob/master/python/ray/data/datasource/datasource.py

@kira-lin
Copy link
Contributor Author

kira-lin commented Oct 8, 2021

I did not use Datasource API for a few reasons. 1. data can be in ray object store before from_partitioned 2. we want to utilize data's locality given by location in the partitions.

@kira-lin
Copy link
Contributor Author

kira-lin commented Oct 8, 2021

I did not use Datasource API for a few reasons:

  1. data can be in ray object store before from_partitioned
  2. we want to utilize data's locality given by location in the partitions.

@fschlimb
Copy link

fschlimb commented Oct 8, 2021

I did not use Datasource API for a few reasons:

  1. data can be in ray object store before from_partitioned
  2. we want to utilize data's locality given by location in the partitions.

Why is that not possible with Datasource?

@fschlimb
Copy link

fschlimb commented Oct 8, 2021

Broader discussion started here: data-apis/consortium-feedback#7

@ericl
Copy link
Contributor

ericl commented Oct 8, 2021

Why is that not possible with Datasource?

I believe both of these are possible with Datasource. For locality, Ray will internally manage locality-aware execution, the use of node: labels is not recommended since it interferes with auto-scaling and fault tolerance.

@kira-lin
Copy link
Contributor Author

kira-lin commented Oct 9, 2021

The data could be in dask, that make things hard. Anyway, I'll update it once the protocol settles

@ericl
Copy link
Contributor

ericl commented Oct 15, 2021

If the data is in Dask on Ray, then locality scheduling will apply to those objects. We don't support Dask unless it's run via Dask on Ray.

Alternatively, we could modify the data source API to allow custom read tasks to be generated (e.g., a PartitionedDataSource could generate read tasks that run on specific nodes according to locality).

Can you let me know if one of the above alternatives works?

@@ -511,6 +511,29 @@ def from_modin(df: "modin.DataFrame") -> Dataset[ArrowRow]:
parts = unwrap_partitions(df, axis=0)
return from_pandas(parts)

def from_partitioned(data) -> Dataset[ArrowRow]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be refactored into a PartitionedDataSource.

@ericl ericl changed the title Add from_partitioned to create MLDataset from any data source implements partitioned Add from_partitioned to create Dataset from any data source implements partitioned Oct 15, 2021
@ericl ericl added the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label Oct 15, 2021
@ericl ericl removed their assignment Oct 21, 2021
@fschlimb
Copy link

fschlimb commented Oct 25, 2021

Why is that not possible with Datasource?

I believe both of these are possible with Datasource. For locality, Ray will internally manage locality-aware execution, the use of node: labels is not recommended since it interferes with auto-scaling and fault tolerance.

Yes, this is generally understood, and as long as we consider "native" ray objects only, this will work just fine without anything special. If we consider data which comes from somewhere else (say, like "dask" or "YetOnotherFancyFamework") the protocol allows us to manually guarantee locality - probably using label:node would be most appropriate - if only for tasks which put the data into ray space. The protocol tries to allow this kind of interoperability without requiring consumers to necessarily support all frameworks.

Notice: ray limits the possibilities to support proper zero-copy when consuming non-ray objects. But at least we can avoid data transfer between nodes.

@bveeramani
Copy link
Member

‼️ ACTION REQUIRED ‼️

We've switched our code formatter from YAPF to Black (see #21311).

To prevent issues with merging your code, here's what you'll need to do:

  1. Install Black
pip install -I black==21.12b0
  1. Format changed files with Black
curl -o format-changed.sh https://gist.githubusercontent.com/bveeramani/42ef0e9e387b755a8a735b084af976f2/raw/7631276790765d555c423b8db2b679fd957b984a/format-changed.sh
chmod +x ./format-changed.sh
./format-changed.sh
rm format-changed.sh
  1. Commit your changes.
git add --all
git commit -m "Format Python code with Black"
  1. Merge master into your branch.
git pull upstream master
  1. Resolve merge conflicts (if necessary).

After running these steps, you'll have the updated format.sh.

@stale
Copy link

stale bot commented Mar 18, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 18, 2022
@stale
Copy link

stale bot commented Apr 3, 2022

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

@stale stale bot closed this Apr 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. stale The issue is stale. It will be closed within 7 days unless there are further conversation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants