Skip to content

feat(experimental): integrate Ray RDT for weight syncing#1305

Open
KaisennHu wants to merge 1 commit into
areal-project:mainfrom
KaisennHu:feat/integrate-rdt
Open

feat(experimental): integrate Ray RDT for weight syncing#1305
KaisennHu wants to merge 1 commit into
areal-project:mainfrom
KaisennHu:feat/integrate-rdt

Conversation

@KaisennHu
Copy link
Copy Markdown

@KaisennHu KaisennHu commented May 6, 2026

Description

This PR implements the RDT (Ray Direct Transport) weight syncing backend

Core changes:

  • IW Scheduler Bridge (rdt_scheduler.py): TransferPlan shard selection + Ray RPC weight pull
  • TW Adapter (rdt/fsdp_adapter.py): FSDP weight metadata extraction + actor handle serialization
  • HTTP Endpoints: TW Flask blueprint + IW FastAPI endpoints
  • Gateway (app.py): RDT mode /connect and /update_weights flow
  • Tensor Transport (ray_rpc_server.py): @ray.method(tensor_transport="YR"|"NIXL") decorated methods

Key features:

  • Supports YR (NPU) and NIXL (GPU) one-sided tensor transport
  • Independent implementation (RDT prefix), no coupling with awex classes
  • Uses TransferPlan.inter_operations for correct TW shard selection

Related Issue

Fixes #1243

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📝 Documentation update
  • ♻️ Refactoring
  • ⚡ Performance improvement
  • ✅ Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • Pre-commit hooks pass (pre-commit run --all-files)
  • Relevant tests pass; new tests added for new functionality
  • Documentation updated (if applicable; built with ./docs/build_all.sh)
  • Branch is up to date with main
  • Self-reviewed via /review-pr command
  • This PR was created by a coding agent via /create-pr
  • This PR is a breaking change

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the Ray Direct Transport (RDT) weight update backend, which utilizes one-sided RDMA (YR for NPU, NIXL for GPU) for weight synchronization between training and inference workers. Key additions include new HTTP endpoints for both services, a scheduler bridge for the inference service, and an FSDP adapter for the training service. Feedback highlights a potential bug in response handling within the gateway, opportunities to reduce code duplication in parameter unfusing logic, and a suggestion to make an internal dispatch method private to prevent API misuse.

Comment thread areal/experimental/weight_update/gateway/app.py
Comment thread areal/experimental/inference_service/sglang/rdt_scheduler.py Outdated
Comment thread areal/infra/rpc/ray_rpc_server.py Outdated
@KaisennHu KaisennHu force-pushed the feat/integrate-rdt branch 2 times, most recently from 770271d to d3547aa Compare May 10, 2026 16:10
@KaisennHu KaisennHu force-pushed the feat/integrate-rdt branch 13 times, most recently from 90c2575 to 494dc8b Compare May 13, 2026 02:13
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
@KaisennHu KaisennHu force-pushed the feat/integrate-rdt branch from 494dc8b to ecaaa37 Compare May 13, 2026 03:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC] Integrate Ray Core RDT for Weight Syncing

1 participant