feat(archon): add colocated CUDA IPC weight transfer for awex by garrett4wade · Pull Request #1310 · areal-project/AReaL

garrett4wade · 2026-05-06T13:57:08Z

Description

Add colocated weight update mode where Megatron training and SGLang inference share the same GPUs. Uses CUDA IPC (zero-copy on same device) instead of NCCL P2P across devices for weight transfer.

Related Issue

N/A (new feature on fw/awex-colocate branch)

Type of Change

✨ New feature

Key Changes

Gateway: Add colocate=True connect mode with find_free_ports for NCCL group initialization
Megatron adapter: Implement execute_colocate_weight_update (serialize weights via IPC, put to KV store, poll for inference done signal), release_memory/resume_memory with CPU offload for optimizer states and model weights
SGLang adapter: Implement execute_colocate_weight_update (fetch IPC weights from KV store, apply via NcclColocateStreamBatchTransport), release_memory/resume_memory with tag tracking
Protocol: Add colocate methods to AwexTrainingAdapter and AwexInferenceAdapter protocols
Worker endpoints: Expose init_colocate_weight_update, execute_colocate_weight_update, release_memory, resume_memory on both training (Flask) and inference (FastAPI) services
Robustness: Handle offloaded weights before all_gather in execute_colocate_weight_update; track _released_tags to prevent double-release/resume
Logging: Suppress werkzeug HTTP request logs; lower inference controller service log level
Tests: Add colocated integration tests (single-version parametrized 2/4/8 GPU + multi-version sequential)

Checklist

I have read the Contributing Guide
Pre-commit hooks pass (pre-commit run --all-files)
Relevant tests pass; new tests added for new functionality
Documentation updated (if applicable; built with ./docs/build_all.sh)
Branch is up to date with main
Self-reviewed via /review-pr command
This PR was created by a coding agent via /create-pr
This PR is a breaking change

Additional Context

Only pure DP (TP=1, PP=1, EP=1) is supported for colocated mode
Weight transfer uses CUDA IPC handles serialized through the gateway KV store
The gateway orchestrates: release optimizer → resume inference weights (no-op) → execute transfer concurrently → release training weights → resume KV cache
master_port is explicitly required (no default) to avoid port conflicts
KV store keys use transfer_rank instead of ip+device_id to avoid CUDA_VISIBLE_DEVICES aliasing issues

Implement colocated weight update mode where Megatron training and SGLang inference share the same GPUs. Uses CUDA IPC (zero-copy on same device) instead of NCCL P2P across devices for weight transfer. Key changes: - Add colocate mode to gateway with find_free_ports for NCCL group - Implement execute_colocate_weight_update in both adapters - Add release_memory/resume_memory with CPU offload for optimizer/weights - Track released tags to prevent double-release or resume of unreleased - Handle offloaded weights in execute_colocate_weight_update (reload before all_gather) - Suppress werkzeug HTTP request logs in Guard - Add colocate integration tests (single + multi-version) Refs: awex-colocate branch

gemini-code-assist

Code Review

This pull request introduces a colocated weight update mode where training and inference share the same GPUs, utilizing CUDA IPC for zero-copy weight transfers. The changes include new API endpoints, adapter implementations for Megatron and SGLang to handle memory offloading (optimizer states, model weights, and KV cache), and gateway orchestration for the transfer process. Feedback identifies critical issues in the memory offloading implementation where tensors were not explicitly moved to CPU, leaving GPU memory occupied. Additionally, a hardcoded security key was flagged, and improvements were suggested for managing HTTP client resources using context managers.

garrett4wade requested review from nuzant and rchardx as code owners May 6, 2026 13:57

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

chore: fix gemini comments

f8c4841

garrett4wade force-pushed the fw/awex-colocate branch from 071b8ab to f8c4841 Compare May 7, 2026 05:18

garrett4wade mentioned this pull request May 7, 2026

[Roadmap] 2026 Q2 Milestones #1302

Open

19 tasks

garrett4wade requested review from HwVanICI, PrometheusComing, TaoZex and sitabulaixizawaluduo May 7, 2026 09:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(archon): add colocated CUDA IPC weight transfer for awex#1310

feat(archon): add colocated CUDA IPC weight transfer for awex#1310
garrett4wade wants to merge 2 commits into
mainfrom
fw/awex-colocate

garrett4wade commented May 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrett4wade commented May 6, 2026

Description

Related Issue

Type of Change

Key Changes

Checklist

Additional Context

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant