Add Megatron-Bridge Kimi-K2 + DeepSeek-V3 test cases: full-parameter SFT + UCCL-EP-over-EFA dispatcher A/B (A/B validated on 256× B300)#1116
Open
KeitaW wants to merge 13 commits into
Open
Conversation
Full-parameter SFT of Kimi K2 (1.04T MoE) on EKS with NVIDIA Megatron-Bridge, using UCCL-EP's EFA-native deep_ep drop-in for the expert-parallel all-to-all (replacing NVSHMEM/IB-bound NVIDIA DeepEP without patching Megatron-Core). Structured framework/library/model: the model-agnostic Megatron-Bridge + UCCL environment image lives at the library level (megatron-bridge/) and is shared by per-model recipes; kimi-k2/ holds the model-specific conf, manifests, and benchmarks. The SFT config is mounted at runtime via a ConfigMap, not baked in.
…mo:26.04.01) - Migrate the shared env image base to nvcr.io/nvidia/nemo:26.04.01 (Megatron-Bridge 0.4.2 / Megatron-Core 0.17.1) to fix the B300 flex/deepep GPU allowlist; build-verify asserts deep_ep resolves to UCCL's venv copy (not NVIDIA dist-packages). - Add the validated dispatcher A/B harness: run-ab-rawpods.sh (raw ranked Pods + headless Service + static torchrun rendezvous; no PyTorchJob CRD required) driving bench_dsv3_pretrain.py (recipe-native DSV3 256-expert substrate; single MOE_DISPATCHER toggle; overlap/VPP/recompute + forced router load-balancing + loss-probe handling). - RESULTS: UCCL deepep is ~36% faster than NCCL all-to-all at micro-batch >= 4, and the win holds under deployment-realistic 1F1B overlap (-35.8% on, -36.0% off); NCCL wins only at mb=1 (overhead-bound). Work-equivalence verified two ways (drop-free config + iteration-1 loss match). - Reconcile all READMEs to the validated raw-pod path; correct the now-false IB-removal and combine-backward-unproven claims; fill the EFA/UCCL log signatures. - Remove the superseded PyTorchJob benchmark path (0.comm-microbench.sh, 1.run-ab-dispatcher.sh, kimi-k2-bench-pytorchjob.yaml-template). - Genericize org-specific identifiers (account, cluster name, kubectl context, capacity-block / reservation IDs) to placeholders / required env overrides.
…mi-K2) The A/B substrate is the deepseek_v3 recipe with DSV3's values (256 routed experts, 128 attention heads), not Kimi-K2 (384 experts, 64 heads, ~1.04T vs ~671B). Relabel the benchmark titles, intros, and model descriptions in RESULTS.md, benchmarks/README.md, and the library README to "DeepSeek-V3 256-expert MoE", with an explicit substrate caveat that it is the architecture family Kimi-K2 belongs to but the literal 384-expert Kimi-K2 number is unrun. The Kimi-K2 SFT-recipe scaffolding (conf/, manifests, checkpoint conversion) keeps its Kimi-K2 naming — only the executed benchmark is relabeled. Also fix a mean-vs-median wording slip in the library README result note (metric of record is mean).
Kimi-K2-Base's HF config uses auto_map -> configuration_deepseek.DeepseekV3Config (custom code), so AutoBridge.from_hf_pretrained fails with "contains custom code which must be executed" unless trust_remote_code=True. Validated on the image (Bridge 0.4.2, nemo:26.04.01): with the flag, AutoBridge routes the DeepseekV3ForCausalLM architecture to DeepSeekV3Bridge and builds the literal Kimi-K2 provider (384 experts, moe_router_num_groups=1, 64 heads, MLA) from the HF config. Replaces the now-resolved TODO(validate against image) marker.
…model case
The dispatcher A/B benchmark ran on the recipe-native DeepSeek-V3 256-expert MoE,
not the literal 384-expert Kimi-K2, so keeping it under kimi-k2/ conflated two
independent experiments. Promote it to a sibling model test case under the shared
megatron-bridge library env:
- git mv kimi-k2/benchmarks/{README,RESULTS,bench_dsv3_pretrain.py,run-ab-rawpods.sh}
-> dsv3/ (flattened; the dir IS the experiment)
- library README: add dsv3 row to the Models table, redraw the layout tree
- repoint every cross-link (kimi-k2 README/conf/kubernetes README/manifest) to the
new sibling location; drop the defunct benchmarks/ Files-table row
- kimi-k2 keeps the SFT recipe (conf/kimi_k2_sft.py, checkpoint conversion, manifests)
FSx paths (/fsx/kimi-k2), PVC (fsx-kimi-k2), and namespace (kimi-k2-bench) left as-is:
they are live cluster artifacts, not repo structure.
…overwrite A/B harness
Adds the infrastructure to run a proper with/without-UCCL dispatcher A/B for BOTH
DeepSeek-V3 and Kimi-K2, with all logs preserved on FSx (no overwrite) for retro and
a per-iteration loss-curve equivalence check.
- kimi-k2/benchmarks/bench_kimi_k2_pretrain.py: builds the LITERAL Kimi-K2 provider
(384 experts, 64 heads, n_group=1, MLA, mtp=0) via AutoBridge, grafts it onto the
DSV3 recipe's mock-data/training scaffolding, re-derives the 61-layer pipeline layout
(mtp-aware -> last stage ["loss"]), and runs pretrain(). Mirrors bench_dsv3_pretrain.py;
same dispatcher A/B toggle + B300-allowlist guard + overlap handling + LOSS_PROBE.
- run-ab-rawpods.sh: promoted from dsv3/ to the library level; now MODEL-aware
(dsv3|kimi-k2) and writes each run to a unique /fsx/megatron-bridge-bench/$CAMPAIGN_ID/
...dir (rank logs, env.txt, STATUS). rank-0 refuses to clobber a completed run.
- bench/run-campaign.sh: drives the 16-run matrix (2 models x {mb1,mb4}x{ovl on,off} x
{alltoall,deepep}) serially, asserts the EFA-active gate per run, frees GPUs between runs.
- bench/parse-runs.py: scrapes rank-0 logs -> per-run loss_curve.csv + campaign index.csv
(mean steady-state iter time, TFLOP/s, tok/s, stalls, efa/uccl validity).
Rendered pod YAML validated; parser unit-tested on a synthetic Megatron log.
…ecompute coupling Resolves the two config-validity gaps that made kimi_k2_sft.py fail out-of-the-box: 1. pipeline_model_parallel_layout was left None — errors on the uneven 61-layer/PP=8 split. Now reuses the DSV3 recipe's MTP-aware layout helper with an explicit mtp_num_layers=0 (Kimi K2 ships no MTP; the helper defaults an ABSENT attr to 1, which would corrupt the layout), plus a loud guard for unsupported (PP,VPP). 2. MOE_A2A_OVERLAP=on (the default) violated core 0.17.1 constraints: overlap requires VPP (PP>1) + recompute fully OFF, but the conf pinned vpp=None + full recompute. VPP/recompute/layout are now finalized together per regime, and delay_wgrad_compute is held OFF (the only validated setting). Validated on the image (Bridge 0.4.2, nemo:26.04.01) via a config-build gate in a GPU-less pod: both regimes produce experts=384 mtp=0 with correct (8,1)/(8,2) 61-decoder layouts (SFT_CONF_GATE_OK). Mirrors bench_kimi_k2_pretrain.py.
…validated on real logs) Megatron prints the per-iteration training line (iteration N/M | elapsed time per iteration (ms) | lm loss | TFLOP/s/GPU) on the LAST rank (last PP stage), not rank 0 — rank-0.log only carries the Bridge 'Step Time' logger and the EFA/UCCL init lines. parse-runs.py was reading rank-0.log and would have produced zero loss curves. Now parses the highest-numbered rank-<r>.log for iteration lines and rank-0.log for EFA/UCCL signals. Validated against the preserved 2026-06-01 raw logs: reproduces the published alltoall mb4 overlap=on result exactly (mean 5.9786s vs published 5.978s, 178.62 vs 178.6 TFLOP/s/GPU, 0 stalls), and extracts the per-iteration lm-loss curve.
run-ab-rawpods.sh moved to the library level and is now MODEL-aware with no-overwrite run dirs. Update every reference and run example in dsv3/README.md (+ RESULTS.md 'How to reproduce'): MODEL=dsv3 bash ../run-ab-rawpods.sh, the campaign driver and post-hoc parser rows, the new /fsx/megatron-bridge-bench/<CAMPAIGN_ID>/ log layout, and the corrected scrape guidance (per-iteration training line is printed by the LAST rank's log, not rank-0; parse with ../bench/parse-runs.py).
core's comm-overlap setup asserts mtp_num_layers is None or == 1 when
overlap_moe_expert_parallel_comm is enabled — int 0 trips it ('MTP layernum only
supports 1...'), which killed the K2 overlap=on canary at runtime_config_update().
None is correct on every path: the DSV3 layout helper treats None as no-MTP
(['loss'] tail) and the overlap assert accepts it. Fixed in both the benchmark
entrypoint and the SFT conf; gate now exercises comm_overlap.setup() too.
…PP8 + PP4 appendix) First dispatcher A/B measured on literal Kimi-K2 (384 experts, AutoBridge provider), campaign 20260604T083049Z-uccl-ab-pp8-32n-v2 on 256x B300 (TP8/PP8/EP32/DP4): - mb=4: UCCL deepep -33.8% iter time w/ overlap (3.834 vs 5.793 s), -35.1% w/o overlap (6.041 vs 9.303 s) - mb=1 no-overlap: NCCL faster by 16.3% (12.88 vs 14.98 s) - same crossover regime as DSV3 - Work equivalence: per-iteration loss curves match to <=4.1e-4 relative over iters 1-10 in all three cell pairs (bf16 round-off, no offset) - Coverage note: mb1+overlap pair and the same-campaign DSV3 re-run were lost to the Capacity Block expiry (6/8 K2 cells banked); partial logs preserved on FSx Also: 16-node PP4 appendix table (within-cell deltas only), dsv3/RESULTS.md block-expiry note, kimi-k2 README updated to point at the literal-K2 results.
…ubdir) Move dsv3 bench entrypoint + RESULTS.md into dsv3/benchmarks/ so both models follow the same <model>/benchmarks/ convention. Update all cross-references (READMEs, conf, manifest, campaign launcher) and the top-level tree; every relative link re-verified to resolve.
…out) dsv3 was benchmark-only; give it a full-parameter SFT half so both models expose the same two workloads in an identical layout (README, conf/, kubernetes/, benchmarks/). - New dsv3/conf/dsv3_sft.py: DeepSeek-V3-256 full-param SFT ConfigContainer, sibling of kimi_k2_sft.py. Keeps DeepSeek-V3's 1 MTP layer (K2 has none), 256 experts (8/EP rank), built on the same deepseek_v3 recipe + flex/deepep dispatch. Scaffolded + unrun; TODOs flagged inline (matches kimi-k2's SFT). - New dsv3/kubernetes/ PyTorchJob manifest + deploy README (dsv3-sft). - Promote kimi-k2/1.convert-checkpoint.sh to a shared, model-agnostic convert-checkpoint.sh at the library level, parameterized by HF_MODEL_ID/HF_REVISION/FSX_ROOT/TRUST_REMOTE_CODE. DeepSeek-V3 conversion uses the dedicated deepseek_v3_bridge shipped in Bridge v0.4.2. - dsv3/README.md: add Full-parameter SFT section; update Files table + the stale conf/kimi_k2_sft.py references (dsv3 now has its own conf). - Top README + kimi-k2 README: shared convert script, both models = SFT + A/B. Validated: bash -n, py_compile, manifest YAML (3 docs), all relative links resolve, copyright headers present. SFT recipes are scaffolded/unrun; the measured dispatcher A/Bs remain the validated part.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Adds a reproducible test case for full-parameter SFT of Kimi K2 (Moonshot AI's 1.04T-param DeepSeek-V3-family MoE) on Amazon EKS with NVIDIA Megatron-Bridge, using UCCL-EP's EFA-native
deep_epdrop-in for the expert-parallel all-to-all. This replaces NVSHMEM/IB-verbs-bound NVIDIA DeepEP (which cannot run on EFA) without patching Megatron-Core — UCCL ships a top-leveldeep_epshadow module so Megatron-Core'sflex/deepepdispatcher transparently routes over EFA via UCCL + GDRCopy.It also lands the first end-to-end DeepEP-on-EFA-vs-NCCL-all-to-all training A/B on B300 (see results below) — no such measurement existed in the literature — measured on two substrates: the DeepSeek-V3 256-expert recipe (2026-06-01) and the literal 384-expert Kimi-K2 built via
AutoBridge(2026-06-04), with per-iteration loss-curve work-equivalence on both.Structure
Laid out
framework/library/modelso future Megatron-Bridge models can reuse the environment:3.test_cases/megatron/megatron-bridge/— model-agnostic shared env (Dockerfile, build + single-node sanity scripts, sharedconvert-checkpoint.sh, CI build test).3.test_cases/megatron/megatron-bridge/kimi-k2/— Kimi K2 (384-expert): full-parameter SFT recipe (conf+ K8s manifests) and the dispatcher A/B (benchmarks/).3.test_cases/megatron/megatron-bridge/dsv3/— DeepSeek-V3 (256-expert): the same two workloads in a structurally identical layout — a full-parameter SFT recipe and the measured dispatcher A/B below.The image bakes no model config; each model mounts its
conf/at/workspace/confat runtime via a ConfigMap, so one image serves every model under the library.Components
nemo:26.04.01(CUDA 13.1, PyTorch 2.11), which already ships Megatron-Bridge v0.4.2 / Megatron-Core 0.17.1 with the flex/deepep dispatcher. (The older25.11.01base shipped Bridge 0.2.0, whose flex/deepep GPU allowlist rejects p6-b300; 0.4.0+ fixes it.)sm_103), 16× EFA/nodedeep_epshadow, pinned commit0dc87eb; IB verbs/NVSHMEM removed)Benchmark result — UCCL-EP vs NCCL all-to-all (256× B300, measured 2026-06-01)
A live 32× p6-b300.48xlarge (256× B300) A/B that swaps only the Megatron-Core MoE token dispatcher — everything else (model, data, parallelism TP8/PP8/EP32/DP4, seq 4096, GBS 256, bf16, balanced routing, image) held byte-identical. Substrate is the
deepseek_v3recipe — DeepSeek-V3 256-expert MoE, the architecture family Kimi K2 belongs to but not the literal 384-expert Kimi K2 (see RESULTS.md for the substrate caveat).The dispatcher winner crosses over with micro-batch granularity:
deep_epMean training-iteration time (lower = better) over 16 steady-state iters after warmup, 0 stalls, EFA active on every rank. At the throughput-efficient operating point (micro-batch ≥ 4) UCCL
deep_epis ~36% faster, and the advantage holds under deployment-realistic 1F1B overlap (contrary to the usual expectation that overlap compresses the dispatcher delta toward parity). NCCL wins only at micro-batch 1 (64 tiny dispatches — UCCL-EP's per-dispatch overhead unamortized), an operating point no throughput-tuned run uses.Work-equivalence verified (no token dropping): both arms' config dumps are byte-identical except the dispatcher selector (no
moe_expert_capacity_factor/moe_token_drop_policyset → drop-free by construction), and an iteration-1 loss match (deepep 11.897349 vs alltoall 11.897517, rel diff 1.4e-5 — bf16 round-off, identical num_tokens). So the −36% is a genuine equal-work speedup.Benchmark result — literal Kimi-K2, 384 experts (256× B300, measured 2026-06-04)
The same A/B re-run on the literal Kimi-K2 (61 layers, 384 routed experts, top-8, MLA, no MTP), with the provider built by
AutoBridge.from_hf_pretrained("moonshotai/Kimi-K2-Base", trust_remote_code=True).to_megatron_provider()— same image, same TP8/PP8/EP32/DP4 layout, mock data, 24 iters/run withlog_interval=1(first 4 dropped). See kimi-k2/benchmarks/RESULTS.md.deep_epThe DSV3 findings transfer to the real K2 architecture: UCCL deepep −34/−35% at mb=4 in both overlap regimes; NCCL faster only at mb=1 no-overlap. Work-equivalence held across the full run: per-iteration
lm losscurves for deepep vs alltoall match within ≤4.1e-4 relative over iters 1–10 in all three cell pairs (bf16 round-off accumulation, no systematic offset). The mb=1+overlap pair was not measured — the Capacity Block expired mid-campaign after 6 of 8 cells (partial logs preserved); a 16-node TP8/PP4/EP32/DP2 appendix table is in the results doc (within-cell deltas only, not comparable across layouts).Testing performed
bash -n+ copyright +set -e, Pythonpy_compile+ copyright, Dockerfile version pins (EFA 1.48.0, GDRCopy v2.5.2, CUDA 13.1) and pinnedFROM, no:latest/binaries/experimental.test_megatron_bridge_uccl.pybuilds the image and assertsimport deep_epresolves to the UCCL wrapper (deep_ep.Bufferpresent).2.sanity-singlenode.sh): Gates 1–4 pass (EFA device present, UCCLdeep_epactive withBuffer, MCore flex/deepep config fields, NCCL all-reduce over EFA). Gate 5's hand-rolledMoEFlexTokenDispatchermicro-step is stale on Core 0.17.1 (predates theProcessGroupCollectionAPI) — not a UCCL/image fault; the realpretrain()path builds the process groups internally and the multi-node benchmark runs clean throughMoEFlexTokenDispatcher(backend="deepep"), which is the authoritative end-to-end dispatch check.NET/OFI Selected provider is efa, fabric is efa-direct (found 16 nics)(true EFA RDMA, no socket fallback) and the UCCL arm logged the UCCL-EP EFA proxy active.Checklist
FROM