Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT-#3451: Support __partitioned__ protocol #3452

Closed

Conversation

YarShev
Copy link
Collaborator

@YarShev YarShev commented Sep 21, 2021

Signed-off-by: Igoshev, Yaroslav [email protected]

What do these changes do?

  • commit message follows format outlined here
  • passes flake8 modin
  • passes black --check modin
  • signed commit with git commit -s
  • Resolves Support __partitioned__ protocol #3451
  • tests passing
  • module layout described at docs/developer/architecture.rst is up-to-date

@YarShev YarShev requested a review from a team as a code owner September 21, 2021 12:03
@codecov
Copy link

codecov bot commented Sep 21, 2021

Codecov Report

Merging #3452 (e460612) into master (7727c23) will decrease coverage by 34.44%.
The diff coverage is 27.57%.

❗ Current head e460612 differs from pull request most recent head 1caf113. Consider uploading reports for the commit 1caf113 to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##           master    #3452       +/-   ##
===========================================
- Coverage   83.23%   48.79%   -34.45%     
===========================================
  Files         147      144        -3     
  Lines       15246    15529      +283     
===========================================
- Hits        12690     7577     -5113     
- Misses       2556     7952     +5396     
Impacted Files Coverage Δ
modin/backends/base/query_compiler.py 62.47% <ø> (-36.81%) ⬇️
modin/conftest.py 55.70% <ø> (-25.88%) ⬇️
modin/distributed/dataframe/pandas/partitions.py 0.00% <0.00%> (-24.40%) ⬇️
...ngines/base/io/column_stores/parquet_dispatcher.py 21.21% <0.00%> (-68.62%) ⬇️
modin/engines/base/io/text/csv_glob_dispatcher.py 10.62% <0.00%> (-67.88%) ⬇️
modin/engines/base/io/text/excel_dispatcher.py 9.16% <0.00%> (-82.18%) ⬇️
modin/engines/base/io/text/fwf_dispatcher.py 9.09% <0.00%> (-78.09%) ⬇️
modin/engines/base/io/text/json_dispatcher.py 20.45% <0.00%> (-75.70%) ⬇️
...in/experimental/backends/omnisci/query_compiler.py 0.00% <0.00%> (-89.33%) ⬇️
modin/experimental/cloud/omnisci.py 0.00% <0.00%> (ø)
... and 131 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7727c23...1caf113. Read the comment docs.

Copy link
Collaborator

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a high priority? What consumers of this type of protocol already exist?

@Garra1980
Copy link
Collaborator

yeah, let's not consider this for 0.11

@fschlimb
Copy link

@devin-petersohn You're raising a chicken-and-egg issue.

Generally, adding support as a consumer is more involved but at the same time adds more value. Adding this at the producer side is less involved. Like here it is very isolated - it has no effect on anything else.

Current Implementation status is

More to come, but we need producers! I'd appreciate seeing modin as a producer.

To add more value for modin, we can add support for consuming this API. A simple "from_partitioned" would provide the basic functionality. "Automatic" detection when accepting data at construction time would be nice to have and could be added later.

@YarShev
Copy link
Collaborator Author

YarShev commented Sep 27, 2021

To add more value for modin, we can add support for consuming this API. A simple "from_partitioned" would provide the basic functionality. "Automatic" detection when accepting data at construction time would be nice to have and could be added later.

It would also help to match partitions of the left object (Modin DF/Series) and partitions of the right object (an instance supporting __partitioned__) when performing operations that may require right object(s).

@devin-petersohn
Copy link
Collaborator

Generally, adding support as a consumer is more involved but at the same time adds more value. Adding this at the producer side is less involved. Like here it is very isolated - it has no effect on anything else.

It's not a big deal to add something, but I don't want code bloat for things that won't be useful. I think it's a good idea to have a protocol like this, but who would use it? What library has said that they actually need an interface like this to be able to work with Modin? We at least need to know if it would be useful to a meaningful consumer before we add it.

Typically these protocols have significant amounts of input from multiple interested parties (producers and consumers) to answer these questions before they are even designed. I just want to make sure we aren't adding something that won't be used. I don't deny the usefulness of this protocol, but whether or not it will be used is still not answered to me.

@fschlimb
Copy link

fschlimb commented Sep 28, 2021

Typically these protocols have significant amounts of input from multiple interested parties (producers and consumers) to answer these questions before they are even designed.

Yes, we are open to feedback and suggestions for different designs and features. As mentioned above, your feedback/suggestions is/are highly appreciated. Talking about something concrete is usually easier than keeping discussions in the abstract.

I just want to make sure we aren't adding something that won't be used. I don't deny the usefulness of this protocol, but whether or not it will be used is still not answered to me.

That's of course a valid request.

One issue in the process of ramping this up is that we are targeting something that is currently less of a concern but we want to make sure we can avoid running into a situation where implementations of distributed features become messy.

It's not a big deal to add something, but I don't want code bloat for things that won't be useful. I think it's a good idea to have a protocol like this, but who would use it? What library has said that they actually need an interface like this to be able to work with Modin? We at least need to know if it would be useful to a meaningful consumer before we add it.

The idea is to allow a packages to consume various distributed containers. It is not meant to be an enabler for modin specifically. Of course packages like xgboost_ray can support modin/RayDataSetl/MLDataSet/HeAT/... by implementing dedicated code for each. Avoiding such specialization for each (in particular upcoming) structure is the major motivation for this.

@YarShev
Copy link
Collaborator Author

YarShev commented Oct 7, 2021

A new discussion was initiated with the data-API consortium: data-apis/consortium-feedback#7

@vnlitvinov
Copy link
Collaborator

@YarShev should we mark this as draft then?

@YarShev
Copy link
Collaborator Author

YarShev commented Nov 1, 2021

@vnlitvinov , yes, we can mark the PR as a draft for now. I'll update it once the protocol settles.

@YarShev YarShev marked this pull request as draft November 1, 2021 07:11
@YarShev
Copy link
Collaborator Author

YarShev commented Jan 12, 2024

Closing this PR as no plans on pushing this forward.

@YarShev YarShev closed this Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support __partitioned__ protocol
5 participants