Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
dd3c234
WIP: Support co-locate training and inference (#81)
zhubohao911 May 7, 2026
1161dba
init doc
zhubohao911 May 11, 2026
2a19cc6
Merge branch 'lightseekorg:main' into feature/colocate-training-infer…
zhubohao911 May 12, 2026
6e63070
feat(colocate): phases 0-2 of MPS-strategy colocate training
zhubohao911 May 13, 2026
fd95a00
Phase 3: NCCL P2P data plane (dummy tensors)
zhubohao911 May 13, 2026
16df545
Phase 4: NCCL hidden-state connector + multi-tensor data plane
zhubohao911 May 13, 2026
b239f5c
Phases 5 + 6: controller trim, init-order fence, MPS hygiene
zhubohao911 May 13, 2026
cb0cc70
Phase 7: numeric parity & convergence test skeletons
zhubohao911 May 13, 2026
b22aff2
Phase 8: colocate usage docs + Qwen3-8B example
zhubohao911 May 13, 2026
ff51ffe
Phase 4: ship the colocate sglang patch + wire Modal image to apply it
zhubohao911 May 13, 2026
21f1350
Modal: layer colocate.patch on top of overlay + assert patch surface …
zhubohao911 May 13, 2026
96fa0ad
colocate: implement Phase-5 sync training loop + driver-side union spec
zhubohao911 May 13, 2026
a11b63d
colocate: bring up MPS pre-Ray and propagate pipe env to controller
zhubohao911 May 13, 2026
5a59f40
colocate: dump MPS daemon log on CUDA error 805
zhubohao911 May 13, 2026
1524815
tests/colocate/one_step: dump nvidia-mps daemon log on failure
zhubohao911 May 13, 2026
bc7df55
colocate: detect 'MPS not supported' and fall back to fractional GPU …
zhubohao911 May 13, 2026
530bf7d
colocate: probe MPS via real CUDA client subprocess (cuInit/cuDeviceG…
zhubohao911 May 13, 2026
19e9603
colocate: switch union world to lazy NCCL init to tolerate slow engin…
zhubohao911 May 13, 2026
7c7e612
tests/colocate/one_step: stream subprocess output to log file (so we …
zhubohao911 May 13, 2026
5f7d302
tests/colocate/one_step: bump timeout to 30min for cold HF cache
zhubohao911 May 13, 2026
900f2fe
docs/colocate: log the Modal MPS / eager-NCCL discoveries from Phase-…
zhubohao911 May 13, 2026
851f5dc
tests/colocate: skip Phase-4+ tests when MPS server can't start
zhubohao911 May 13, 2026
55b22d5
tests/colocate/test_placement: handle None handle from MPS-fallback path
zhubohao911 May 13, 2026
d947716
docs/colocate: final Phase 1-7 verification matrix on Modal sandbox
zhubohao911 May 13, 2026
9633f64
colocate: add cheap-host MPS smoke (1×GPU, Qwen3-0.6B) + fix mps-help…
zhubohao911 May 13, 2026
e8f7a26
docs/colocate: cheap-host test plan + --full runner mode for agent ha…
zhubohao911 May 13, 2026
925de62
scripts/colocate: harden runner with real MPS pre-flight + auto-report
zhubohao911 May 13, 2026
5b891a8
mooncake/store: lazy-import so colocate doesn't need libibverbs/libnuma
zhubohao911 May 13, 2026
34edb68
docs/colocate: add bilingual GPU/CUDA knowledge supplement
zhubohao911 May 13, 2026
9bbb263
utils/logging: configure 'torchspec' namespace so submodule INFO surf…
zhubohao911 May 13, 2026
c7ffdb8
docs/colocate: RunPod validation session findings + SM89+ requirement
zhubohao911 May 14, 2026
182da4a
colocate: instrument TP scheduler init path to surface NCCL rendezvou…
zhubohao911 May 14, 2026
dbc0796
colocate.patch: fix @@ hunk line counts after TS-COLOCATE-TRACE injec…
zhubohao911 May 14, 2026
c74607c
colocate.patch: switch TS-COLOCATE-TRACE prints to logger.warning
zhubohao911 May 14, 2026
ad9b413
colocate: defang dist.new_group in TP scheduler subprocess to break d…
zhubohao911 May 14, 2026
755cc1e
colocate: align trainer + engine world-group new_group sequence
zhubohao911 May 14, 2026
15e5797
colocate.patch: fix ModelRunner hunk +line count (88 -> 92)
zhubohao911 May 14, 2026
9dce844
colocate/world: align use_local_synchronization with engine side
zhubohao911 May 14, 2026
be36985
colocate: dp_attention.py post-patch surgery for engine rank offset
zhubohao911 May 14, 2026
ebadf36
trainer: build colocate-aware trainer-only DP mesh
zhubohao911 May 14, 2026
67fca8c
docs/colocate: iter 1-10 RunPod debug session findings
zhubohao911 May 14, 2026
76f3d6b
colocate: trainer-only gloo group + 1-rank DP fallback to gloo
zhubohao911 May 14, 2026
e6e0f49
fsdp: scope broadcasts to mesh_group, not default PG
zhubohao911 May 14, 2026
9c95194
fsdp: disable broadcast_from_rank0 for single-rank trainer mesh
zhubohao911 May 14, 2026
252591f
training: scope all trainer-side dist collectives to trainer-only group
zhubohao911 May 14, 2026
953531f
target_utils: handle tied-embedding models in TargetLMHead loader
zhubohao911 May 14, 2026
f3ad648
colocate: rebuild sglang _WORLD as engine-only [N,2N)
zhubohao911 May 14, 2026
2e6b16b
colocate: fix tp_worker broadcast_pyobj global-rank arg (post-patch s…
zhubohao911 May 14, 2026
38bb1da
add Eagle3 colocate aux_hidden_states_layers auto-resolver
zhubohao911 May 14, 2026
aad72e2
colocate: route hidden-state P2P over gloo, not the NCCL union world
zhubohao911 May 14, 2026
cd69fc2
colocate: read train/avg_loss, the key the trainer actually emits
zhubohao911 May 14, 2026
2aaa010
docs/colocate: iters 11-20 session log — test_colocate_tiny.py green
zhubohao911 May 14, 2026
927beaa
colocate: log peak_alloc in the per-step line for the stability test
zhubohao911 May 14, 2026
33b7e26
colocate: fix engine union-world rank computation for N>1
May 14, 2026
a5a0288
colocate: create all shared new_groups before role-restricted ones
May 14, 2026
058871d
colocate: dp_attention rank offset must be the engine's own union rank
May 15, 2026
bdc30ae
colocate: scope set_model_state_dict broadcast to the trainer mesh
May 15, 2026
bd7a5e5
docs/colocate: Vast session #4 — 4xH100 --full suite green (runs #1-#7)
May 15, 2026
a85cec7
docs/colocate: expand session #4 — debug methodology + next steps
May 15, 2026
59400f1
colocate: scope dcp.save / dcp.load to the trainer-only group
May 15, 2026
6b1115b
docs/colocate: session #5 — verification re-run + single_rank audit +…
May 15, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 85 additions & 0 deletions configs/colocate_qwen0p6b_tiny.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Tiny-model colocate config for cheap-host MPS validation.
#
# Same colocate code path as `configs/colocate_qwen3_8b.yaml` (MPS strategy +
# NCCL transfer + Phase-0 invariants), but sized so the entire trainer +
# engine + KV-cache footprint fits inside a single 24 GB consumer/L40S-class
# GPU. The intent is to give people without 4×H100 access a way to actually
# *run* the MPS-required Phase-4/6/7 tests on a $0.30-2.00/hr cheap GPU
# rental (Vast.ai, Lambda spot, Hyperstack, etc.) for a one-shot
# correctness check.
#
# Footprint at a glance (Qwen3-0.6B Base, 600 M params, fp16):
# - trainer (FSDP world=1, no sharding): weights 1.2 GB + grads 1.2 GB
# + AdamW fp32 state 4.8 GB ≈ 7.2 GB → fits in 0.45×24 GB = 10.8 GB.
# - engine (sglang, tp=1): weights 1.2 GB + KV cache for 16 K ctx
# ≈ 4 GB ≈ 5.2 GB → fits in 0.45×24 GB = 10.8 GB.
# - 0.10 headroom = 2.4 GB on a 24 GB card; CUDA context + allocator
# caches comfortably fit.
#
# Phase-0 invariant: engine_count × engine_tp_size == world_size = 1×1 = 1.
#
# Run via the local Docker / Vast.ai runner, not the Modal smoke script:
# bash scripts/colocate/run_smoke_host.sh

model:
target_model_path: Qwen/Qwen3-0.6B-Base
trust_remote_code: true

dataset:
train_data_path: ../examples/data/sample_conversations.jsonl
chat_template: qwen
prompt_key: conversations

training:
attention_backend: flex_attention
micro_batch_size: 1
draft_accumulation_steps: 1
learning_rate: 1e-4
max_concurrent_batches: 1
max_grad_norm: 0.5
# Smaller than the Qwen3-8B config so KV cache fits in 0.45×24 GB.
max_seq_length: 2048
num_epochs: 1
seed: 42
# 1:1 trainer↔engine on a single GPU. world_size = 1.
training_num_gpus_per_node: 1
training_num_nodes: 1
ttt_length: 7
save_per_epoch: false
warmup_ratio: 0.015

# ─── Colocate flags (same as Qwen3-8B config) ────────────────────
colocate_strategy: mps
transfer_mode: nccl
train_frac: 0.45
infer_frac: 0.45

inference:
inference_engine_type: sgl
# 1 engine, 1 GPU, tp=1 — the only topology that satisfies the Phase-0
# invariant `engine_count × engine_tp_size == world_size = 1`.
inference_num_gpus: 1
inference_num_gpus_per_engine: 1
inference_num_gpus_per_node: 1
max_sample_pool_size: 8
inference_buffer_threshold: 4
inference_batch_size: 2
sglang:
tp_size: 1
mem_fraction_static: 0.45

mooncake:
master_server_address: null
metadata_server: null
protocol: tcp
global_segment_size: 4GB
local_buffer_size: 1GB

output_dir: ./outputs/colocate-qwen0p6b-tiny
cache_dir: ./cache/colocate-qwen0p6b-tiny
model_download_dir: null

debug:
save_debug_train_data: null
debug_train_only: false
debug_inference_only: false
89 changes: 89 additions & 0 deletions configs/colocate_qwen3_8b.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Configuration for colocate (MPS+NCCL) training on a single 4×H100 node.
#
# This is the colocate sibling of `configs/sglang_qwen3_8b.yaml`. The two
# configs differ in three places:
#
# 1. `training.colocate_strategy: mps` + `training.transfer_mode: nccl`
# enable the colocate path (Phase 0 invariants).
# 2. `training.train_frac` + `training.infer_frac` set the per-GPU
# memory split (Phase 1 invariant: train + infer + 0.10 headroom <= 1.0).
# 3. `inference.inference_num_gpus` == `training.training_num_gpus_per_node`
# and `inference.inference_num_gpus_per_engine == 1`. This pins the
# 1:1 trainer↔engine-rank pairing the union NCCL world expects
# (Phase 2 invariant: engine_count × engine_tp_size == training_world_size).
#
# Everything else mirrors the disaggregated config so a side-by-side
# comparison is meaningful (Phase 7 grad parity + convergence runs).
#
# Run:
# ./examples/colocate-qwen3-8b-1node/run.sh

model:
target_model_path: Qwen/Qwen3-8B
trust_remote_code: true

dataset:
train_data_path: ../examples/data/sample_conversations.jsonl
chat_template: qwen
prompt_key: conversations

training:
attention_backend: flex_attention
micro_batch_size: 1
draft_accumulation_steps: 1
learning_rate: 1e-4
max_concurrent_batches: 1
max_grad_norm: 0.5
max_seq_length: 16384
num_epochs: 1
seed: 42
training_num_gpus_per_node: 4
training_num_nodes: 1
ttt_length: 7
save_per_epoch: true
warmup_ratio: 0.015

# ─── Colocate flags (Phase 0–4) ─────────────────────────────────
# mps: trainer + engine ranks share one physical GPU via NVIDIA MPS.
# nccl: hidden states cross the engine→trainer boundary via P2P
# `dist.batch_isend_irecv` on the Phase-2 union world (no Mooncake).
colocate_strategy: mps
transfer_mode: nccl
train_frac: 0.45
infer_frac: 0.45

inference:
inference_engine_type: sgl
# 1:1 trainer↔engine-rank pairing — see Phase 1 config invariant C.
inference_num_gpus: 4
inference_num_gpus_per_engine: 1
inference_num_gpus_per_node: 4
max_sample_pool_size: 64 # unused under colocate, kept for symmetry
inference_buffer_threshold: 32
inference_batch_size: 8
sglang:
tp_size: 1
# Unused under colocate — `infer_frac` is the canonical budget; SglEngine
# overrides `mem_fraction_static` to match. Setting it here just docs the
# equivalence.
mem_fraction_static: 0.45

# Mooncake config is not required when transfer_mode=nccl, but the
# parser still expects the section. Leaving it as null sentinel; the
# colocate train_entry branch never invokes build_mooncake_config so
# these never get used.
mooncake:
master_server_address: null
metadata_server: null
protocol: tcp
global_segment_size: 16GB
local_buffer_size: 4GB

output_dir: ./outputs/colocate-qwen3-8b-1node
cache_dir: ./cache/colocate-qwen3-8b-1node
model_download_dir: null

debug:
save_debug_train_data: null
debug_train_only: false
debug_inference_only: false
Loading
Loading