Skip to content

feat: add ray repartition pipeline#985

Open
macroguo-ghy wants to merge 2 commits into
datajuicer:mainfrom
macroguo-ghy:feat/repartition-mapper
Open

feat: add ray repartition pipeline#985
macroguo-ghy wants to merge 2 commits into
datajuicer:mainfrom
macroguo-ghy:feat/repartition-mapper

Conversation

@macroguo-ghy
Copy link
Copy Markdown
Contributor

@macroguo-ghy macroguo-ghy commented May 25, 2026

Summary

  • add a Ray-only repartition pipeline for dataset-level block repartitioning
  • register the pipeline and list it in config_all.yaml and operator docs
  • add unit coverage for repartition options, local-executor failure, and load_ops registration

Validation

  • pre-commit run --files data_juicer/ops/pipeline/repartition_pipeline.py data_juicer/ops/pipeline/init.py data_juicer/ops/mapper/init.py data_juicer/config/config_all.yaml docs/Operators.md docs/operators/pipeline/repartition_pipeline.md tests/ops/pipeline/test_repartition_pipeline.py
  • python3 -m pytest tests/ops/pipeline/test_repartition_pipeline.py
  • python3 -m py_compile data_juicer/ops/pipeline/repartition_pipeline.py tests/ops/pipeline/test_repartition_pipeline.py && git diff --check
  • load_ops registration smoke check

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the repartition_mapper operator, which allows for repartitioning Ray Datasets into a target number of blocks. The implementation includes the operator logic, configuration updates, documentation, and unit tests. Review feedback consistently points out that since the operator inherits from the Pipeline class and performs dataset-level transformations, it should be renamed to RepartitionPipeline and reclassified from a mapper to a pipeline across the codebase and documentation to maintain architectural consistency.

Comment thread data_juicer/ops/mapper/repartition_mapper.py Outdated
Comment thread docs/Operators.md Outdated
Comment thread docs/Operators.md Outdated
Comment thread docs/operators/mapper/repartition_mapper.md Outdated
@macroguo-ghy macroguo-ghy force-pushed the feat/repartition-mapper branch from cbfbbc5 to 4410e80 Compare May 25, 2026 09:54
@macroguo-ghy macroguo-ghy changed the title feat: add repartition mapper feat: add repartition pipeline May 25, 2026
@macroguo-ghy macroguo-ghy marked this pull request as ready for review May 25, 2026 11:19
@cmgzn
Copy link
Copy Markdown
Collaborator

cmgzn commented May 26, 2026

This is a very useful new op, thanks! The implementation looks good to me. One minor suggestion: since it is Ray-only, maybe rename it to ray_repartition_pipeline for consistency with other Ray-only ops. If you prefer, we can also make this adjustment on our side.

@macroguo-ghy
Copy link
Copy Markdown
Contributor Author

Thanks for the suggestion! Renamed it to ray_repartition_pipeline and updated the code, config, docs, and tests accordingly.

@macroguo-ghy macroguo-ghy changed the title feat: add repartition pipeline feat: add ray repartition pipeline May 26, 2026
@cmgzn cmgzn requested a review from fengrui-z May 27, 2026 03:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants