Skip to content

feat(archon): add colocated CUDA IPC weight transfer for awex#1310

Open
garrett4wade wants to merge 2 commits into
mainfrom
fw/awex-colocate
Open

feat(archon): add colocated CUDA IPC weight transfer for awex#1310
garrett4wade wants to merge 2 commits into
mainfrom
fw/awex-colocate

Conversation

@garrett4wade
Copy link
Copy Markdown
Collaborator

Description

Add colocated weight update mode where Megatron training and SGLang inference share the same GPUs. Uses CUDA IPC (zero-copy on same device) instead of NCCL P2P across devices for weight transfer.

Related Issue

N/A (new feature on fw/awex-colocate branch)

Type of Change

  • ✨ New feature

Key Changes

  • Gateway: Add colocate=True connect mode with find_free_ports for NCCL group initialization
  • Megatron adapter: Implement execute_colocate_weight_update (serialize weights via IPC, put to KV store, poll for inference done signal), release_memory/resume_memory with CPU offload for optimizer states and model weights
  • SGLang adapter: Implement execute_colocate_weight_update (fetch IPC weights from KV store, apply via NcclColocateStreamBatchTransport), release_memory/resume_memory with tag tracking
  • Protocol: Add colocate methods to AwexTrainingAdapter and AwexInferenceAdapter protocols
  • Worker endpoints: Expose init_colocate_weight_update, execute_colocate_weight_update, release_memory, resume_memory on both training (Flask) and inference (FastAPI) services
  • Robustness: Handle offloaded weights before all_gather in execute_colocate_weight_update; track _released_tags to prevent double-release/resume
  • Logging: Suppress werkzeug HTTP request logs; lower inference controller service log level
  • Tests: Add colocated integration tests (single-version parametrized 2/4/8 GPU + multi-version sequential)

Checklist

  • I have read the Contributing Guide
  • Pre-commit hooks pass (pre-commit run --all-files)
  • Relevant tests pass; new tests added for new functionality
  • Documentation updated (if applicable; built with ./docs/build_all.sh)
  • Branch is up to date with main
  • Self-reviewed via /review-pr command
  • This PR was created by a coding agent via /create-pr
  • This PR is a breaking change

Additional Context

  • Only pure DP (TP=1, PP=1, EP=1) is supported for colocated mode
  • Weight transfer uses CUDA IPC handles serialized through the gateway KV store
  • The gateway orchestrates: release optimizer → resume inference weights (no-op) → execute transfer concurrently → release training weights → resume KV cache
  • master_port is explicitly required (no default) to avoid port conflicts
  • KV store keys use transfer_rank instead of ip+device_id to avoid CUDA_VISIBLE_DEVICES aliasing issues

Implement colocated weight update mode where Megatron training and
SGLang inference share the same GPUs. Uses CUDA IPC (zero-copy on same
device) instead of NCCL P2P across devices for weight transfer.

Key changes:
- Add colocate mode to gateway with find_free_ports for NCCL group
- Implement execute_colocate_weight_update in both adapters
- Add release_memory/resume_memory with CPU offload for optimizer/weights
- Track released tags to prevent double-release or resume of unreleased
- Handle offloaded weights in execute_colocate_weight_update (reload before all_gather)
- Suppress werkzeug HTTP request logs in Guard
- Add colocate integration tests (single + multi-version)

Refs: awex-colocate branch
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a colocated weight update mode where training and inference share the same GPUs, utilizing CUDA IPC for zero-copy weight transfers. The changes include new API endpoints, adapter implementations for Megatron and SGLang to handle memory offloading (optimizer states, model weights, and KV cache), and gateway orchestration for the transfer process. Feedback identifies critical issues in the memory offloading implementation where tensors were not explicitly moved to CPU, leaving GPU memory occupied. Additionally, a hardcoded security key was flagged, and improvements were suggested for managing HTTP client resources using context managers.

Comment thread areal/experimental/weight_update/awex/megatron_adapter.py Outdated
Comment thread areal/experimental/weight_update/awex/megatron_adapter.py Outdated
Comment thread areal/experimental/weight_update/awex/megatron_adapter.py Outdated
Comment thread areal/experimental/weight_update/awex/megatron_adapter.py Outdated
Comment thread areal/experimental/weight_update/awex/sglang_adapter.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant