feat(archon): add colocated CUDA IPC weight transfer for awex#1310
Open
garrett4wade wants to merge 2 commits into
Open
feat(archon): add colocated CUDA IPC weight transfer for awex#1310garrett4wade wants to merge 2 commits into
garrett4wade wants to merge 2 commits into
Conversation
Implement colocated weight update mode where Megatron training and SGLang inference share the same GPUs. Uses CUDA IPC (zero-copy on same device) instead of NCCL P2P across devices for weight transfer. Key changes: - Add colocate mode to gateway with find_free_ports for NCCL group - Implement execute_colocate_weight_update in both adapters - Add release_memory/resume_memory with CPU offload for optimizer/weights - Track released tags to prevent double-release or resume of unreleased - Handle offloaded weights in execute_colocate_weight_update (reload before all_gather) - Suppress werkzeug HTTP request logs in Guard - Add colocate integration tests (single + multi-version) Refs: awex-colocate branch
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces a colocated weight update mode where training and inference share the same GPUs, utilizing CUDA IPC for zero-copy weight transfers. The changes include new API endpoints, adapter implementations for Megatron and SGLang to handle memory offloading (optimizer states, model weights, and KV cache), and gateway orchestration for the transfer process. Feedback identifies critical issues in the memory offloading implementation where tensors were not explicitly moved to CPU, leaving GPU memory occupied. Additionally, a hardcoded security key was flagged, and improvements were suggested for managing HTTP client resources using context managers.
071b8ab to
f8c4841
Compare
19 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Add colocated weight update mode where Megatron training and SGLang inference share the same GPUs. Uses CUDA IPC (zero-copy on same device) instead of NCCL P2P across devices for weight transfer.
Related Issue
N/A (new feature on
fw/awex-colocatebranch)Type of Change
Key Changes
colocate=Trueconnect mode withfind_free_portsfor NCCL group initializationexecute_colocate_weight_update(serialize weights via IPC, put to KV store, poll for inference done signal),release_memory/resume_memorywith CPU offload for optimizer states and model weightsexecute_colocate_weight_update(fetch IPC weights from KV store, apply viaNcclColocateStreamBatchTransport),release_memory/resume_memorywith tag trackingAwexTrainingAdapterandAwexInferenceAdapterprotocolsinit_colocate_weight_update,execute_colocate_weight_update,release_memory,resume_memoryon both training (Flask) and inference (FastAPI) servicesall_gatherinexecute_colocate_weight_update; track_released_tagsto prevent double-release/resumeChecklist
pre-commit run --all-files)./docs/build_all.sh)main/review-prcommand/create-prAdditional Context
master_portis explicitly required (no default) to avoid port conflictstransfer_rankinstead ofip+device_idto avoid CUDA_VISIBLE_DEVICES aliasing issues