Add Megatron-Bridge Kimi-K2 + DeepSeek-V3 test cases: full-parameter SFT + UCCL-EP-over-EFA dispatcher A/B (A/B validated on 256× B300) by KeitaW · Pull Request #1116 · awslabs/awsome-distributed-ai

KeitaW · 2026-05-31T12:52:58Z

What this PR does

Adds a reproducible test case for full-parameter SFT of Kimi K2 (Moonshot AI's 1.04T-param DeepSeek-V3-family MoE) on Amazon EKS with NVIDIA Megatron-Bridge, using UCCL-EP's EFA-native deep_ep drop-in for the expert-parallel all-to-all. This replaces NVSHMEM/IB-verbs-bound NVIDIA DeepEP (which cannot run on EFA) without patching Megatron-Core — UCCL ships a top-level deep_ep shadow module so Megatron-Core's flex/deepep dispatcher transparently routes over EFA via UCCL + GDRCopy.

It also lands the first end-to-end DeepEP-on-EFA-vs-NCCL-all-to-all training A/B on B300 (see results below) — no such measurement existed in the literature — measured on two substrates: the DeepSeek-V3 256-expert recipe (2026-06-01) and the literal 384-expert Kimi-K2 built via AutoBridge (2026-06-04), with per-iteration loss-curve work-equivalence on both.

Structure

Laid out framework/library/model so future Megatron-Bridge models can reuse the environment:

3.test_cases/megatron/megatron-bridge/ — model-agnostic shared env (Dockerfile, build + single-node sanity scripts, shared convert-checkpoint.sh, CI build test).
3.test_cases/megatron/megatron-bridge/kimi-k2/ — Kimi K2 (384-expert): full-parameter SFT recipe (conf + K8s manifests) and the dispatcher A/B (benchmarks/).
3.test_cases/megatron/megatron-bridge/dsv3/ — DeepSeek-V3 (256-expert): the same two workloads in a structurally identical layout — a full-parameter SFT recipe and the measured dispatcher A/B below.

The image bakes no model config; each model mounts its conf/ at /workspace/conf at runtime via a ConfigMap, so one image serves every model under the library.

Components

Framework: NVIDIA Megatron-Bridge — NGC nemo:26.04.01 (CUDA 13.1, PyTorch 2.11), which already ships Megatron-Bridge v0.4.2 / Megatron-Core 0.17.1 with the flex/deepep dispatcher. (The older 25.11.01 base shipped Bridge 0.2.0, whose flex/deepep GPU allowlist rejects p6-b300; 0.4.0+ fixes it.)
Models/Workloads: Kimi-K2 (1.04T MoE, 384 experts, MLA) and DeepSeek-V3 (671B MoE, 256 experts, MLA), each with full-parameter SFT and the UCCL-EP-over-EFA dispatcher A/B, in a structurally identical layout
Hardware target: 32× p6-b300.48xlarge (256× B300, sm_103), 16× EFA/node
Orchestration: Kubernetes / EKS (Kubeflow PyTorchJob, etcd rendezvous, optional KAI gang scheduling)
Transport: UCCL-EP over AWS EFA (deep_ep shadow, pinned commit 0dc87eb; IB verbs/NVSHMEM removed)

Benchmark result — UCCL-EP vs NCCL all-to-all (256× B300, measured 2026-06-01)

A live 32× p6-b300.48xlarge (256× B300) A/B that swaps only the Megatron-Core MoE token dispatcher — everything else (model, data, parallelism TP8/PP8/EP32/DP4, seq 4096, GBS 256, bf16, balanced routing, image) held byte-identical. Substrate is the deepseek_v3 recipe — DeepSeek-V3 256-expert MoE, the architecture family Kimi K2 belongs to but not the literal 384-expert Kimi K2 (see RESULTS.md for the substrate caveat).

The dispatcher winner crosses over with micro-batch granularity:

micro-batch	overlap	NCCL all-to-all	UCCL `deep_ep`	dispatcher delta
1	off	12.54 s	14.12 s	NCCL +12.6% faster
4	off	9.77 s	6.26 s	UCCL −36.0% faster
4	on	5.98 s	3.84 s	UCCL −35.8% faster

Mean training-iteration time (lower = better) over 16 steady-state iters after warmup, 0 stalls, EFA active on every rank. At the throughput-efficient operating point (micro-batch ≥ 4) UCCL deep_ep is ~36% faster, and the advantage holds under deployment-realistic 1F1B overlap (contrary to the usual expectation that overlap compresses the dispatcher delta toward parity). NCCL wins only at micro-batch 1 (64 tiny dispatches — UCCL-EP's per-dispatch overhead unamortized), an operating point no throughput-tuned run uses.

Work-equivalence verified (no token dropping): both arms' config dumps are byte-identical except the dispatcher selector (no moe_expert_capacity_factor/moe_token_drop_policy set → drop-free by construction), and an iteration-1 loss match (deepep 11.897349 vs alltoall 11.897517, rel diff 1.4e-5 — bf16 round-off, identical num_tokens). So the −36% is a genuine equal-work speedup.

Benchmark result — literal Kimi-K2, 384 experts (256× B300, measured 2026-06-04)

The same A/B re-run on the literal Kimi-K2 (61 layers, 384 routed experts, top-8, MLA, no MTP), with the provider built by AutoBridge.from_hf_pretrained("moonshotai/Kimi-K2-Base", trust_remote_code=True).to_megatron_provider() — same image, same TP8/PP8/EP32/DP4 layout, mock data, 24 iters/run with log_interval=1 (first 4 dropped). See kimi-k2/benchmarks/RESULTS.md.

micro-batch	overlap	NCCL all-to-all	UCCL `deep_ep`	dispatcher delta
1	off	12.88 s	14.98 s	NCCL +16.3% faster
4	off	9.30 s	6.04 s	UCCL −35.1% faster
4	on	5.79 s	3.83 s	UCCL −33.8% faster

The DSV3 findings transfer to the real K2 architecture: UCCL deepep −34/−35% at mb=4 in both overlap regimes; NCCL faster only at mb=1 no-overlap. Work-equivalence held across the full run: per-iteration lm loss curves for deepep vs alltoall match within ≤4.1e-4 relative over iters 1–10 in all three cell pairs (bf16 round-off accumulation, no systematic offset). The mb=1+overlap pair was not measured — the Capacity Block expired mid-campaign after 6 of 8 cells (partial logs preserved); a 16-node TP8/PP4/EP32/DP2 appendix table is in the results doc (within-cell deltas only, not comparable across layouts).

Testing performed

Local CI-equivalent lint: shell bash -n + copyright + set -e, Python py_compile + copyright, Dockerfile version pins (EFA 1.48.0, GDRCopy v2.5.2, CUDA 13.1) and pinned FROM, no :latest/binaries/experimental.
Both PyTorchJob manifests validated as multi-doc YAML; ConfigMap volume/mount placement checked.
test_megatron_bridge_uccl.py builds the image and asserts import deep_ep resolves to the UCCL wrapper (deep_ep.Buffer present).
Single-node 8-GPU sanity gate (2.sanity-singlenode.sh): Gates 1–4 pass (EFA device present, UCCL deep_ep active with Buffer, MCore flex/deepep config fields, NCCL all-reduce over EFA). Gate 5's hand-rolled MoEFlexTokenDispatcher micro-step is stale on Core 0.17.1 (predates the ProcessGroupCollection API) — not a UCCL/image fault; the real pretrain() path builds the process groups internally and the multi-node benchmark runs clean through MoEFlexTokenDispatcher(backend="deepep"), which is the authoritative end-to-end dispatch check.
Multi-node validation: the 256× B300 dispatcher A/B above ran end-to-end on both arms; every rank logged NET/OFI Selected provider is efa, fabric is efa-direct (found 16 nics) (true EFA RDMA, no socket fallback) and the UCCL arm logged the UCCL-EP EFA proxy active.

Scope note. Both the DeepSeek-V3 256-expert and the literal 384-expert Kimi-K2 dispatcher A/Bs are validated at 256× B300 (tables above). The K2 provider comes from AutoBridge (no hand-derived routing overrides needed — K2's HF config carries n_group=1/topk_group=1 and num_nextn_predict_layers=0, and the AutoBridge-built provider preserves them). Benchmarks use random init + mock data and measure throughput + dispatcher work-equivalence; real-data SFT convergence is not benchmarked here. The K2 HF→MCore checkpoint conversion path was exercised end-to-end (1.6 TiB MCore checkpoint produced). Both models now also ship a full-parameter SFT recipe (conf/ + K8s manifests) and share one model-agnostic convert-checkpoint.sh; these SFT recipes — including the new DeepSeek-V3 one — are scaffolded and unrun (TODOs flagged inline; DeepSeek-V3 keeps its 1 MTP layer where K2 has none), distinct from the measured dispatcher A/Bs above. The shared convert-checkpoint.sh runs the AutoBridge import_ckpt weight-conversion path, which is exercised for K2 but flagged unverified for DeepSeek-V3 on this image.

Checklist

Dockerfile pins versions (EFA, GDRCopy, CUDA, UCCL commit) and pinned FROM
Scripts have copyright headers
README documents prerequisites and usage
No large binaries committed

Full-parameter SFT of Kimi K2 (1.04T MoE) on EKS with NVIDIA Megatron-Bridge, using UCCL-EP's EFA-native deep_ep drop-in for the expert-parallel all-to-all (replacing NVSHMEM/IB-bound NVIDIA DeepEP without patching Megatron-Core). Structured framework/library/model: the model-agnostic Megatron-Bridge + UCCL environment image lives at the library level (megatron-bridge/) and is shared by per-model recipes; kimi-k2/ holds the model-specific conf, manifests, and benchmarks. The SFT config is mounted at runtime via a ConfigMap, not baked in.

…mo:26.04.01) - Migrate the shared env image base to nvcr.io/nvidia/nemo:26.04.01 (Megatron-Bridge 0.4.2 / Megatron-Core 0.17.1) to fix the B300 flex/deepep GPU allowlist; build-verify asserts deep_ep resolves to UCCL's venv copy (not NVIDIA dist-packages). - Add the validated dispatcher A/B harness: run-ab-rawpods.sh (raw ranked Pods + headless Service + static torchrun rendezvous; no PyTorchJob CRD required) driving bench_dsv3_pretrain.py (recipe-native DSV3 256-expert substrate; single MOE_DISPATCHER toggle; overlap/VPP/recompute + forced router load-balancing + loss-probe handling). - RESULTS: UCCL deepep is ~36% faster than NCCL all-to-all at micro-batch >= 4, and the win holds under deployment-realistic 1F1B overlap (-35.8% on, -36.0% off); NCCL wins only at mb=1 (overhead-bound). Work-equivalence verified two ways (drop-free config + iteration-1 loss match). - Reconcile all READMEs to the validated raw-pod path; correct the now-false IB-removal and combine-backward-unproven claims; fill the EFA/UCCL log signatures. - Remove the superseded PyTorchJob benchmark path (0.comm-microbench.sh, 1.run-ab-dispatcher.sh, kimi-k2-bench-pytorchjob.yaml-template). - Genericize org-specific identifiers (account, cluster name, kubectl context, capacity-block / reservation IDs) to placeholders / required env overrides.

…mi-K2) The A/B substrate is the deepseek_v3 recipe with DSV3's values (256 routed experts, 128 attention heads), not Kimi-K2 (384 experts, 64 heads, ~1.04T vs ~671B). Relabel the benchmark titles, intros, and model descriptions in RESULTS.md, benchmarks/README.md, and the library README to "DeepSeek-V3 256-expert MoE", with an explicit substrate caveat that it is the architecture family Kimi-K2 belongs to but the literal 384-expert Kimi-K2 number is unrun. The Kimi-K2 SFT-recipe scaffolding (conf/, manifests, checkpoint conversion) keeps its Kimi-K2 naming — only the executed benchmark is relabeled. Also fix a mean-vs-median wording slip in the library README result note (metric of record is mean).

Kimi-K2-Base's HF config uses auto_map -> configuration_deepseek.DeepseekV3Config (custom code), so AutoBridge.from_hf_pretrained fails with "contains custom code which must be executed" unless trust_remote_code=True. Validated on the image (Bridge 0.4.2, nemo:26.04.01): with the flag, AutoBridge routes the DeepseekV3ForCausalLM architecture to DeepSeekV3Bridge and builds the literal Kimi-K2 provider (384 experts, moe_router_num_groups=1, 64 heads, MLA) from the HF config. Replaces the now-resolved TODO(validate against image) marker.

…model case The dispatcher A/B benchmark ran on the recipe-native DeepSeek-V3 256-expert MoE, not the literal 384-expert Kimi-K2, so keeping it under kimi-k2/ conflated two independent experiments. Promote it to a sibling model test case under the shared megatron-bridge library env: - git mv kimi-k2/benchmarks/{README,RESULTS,bench_dsv3_pretrain.py,run-ab-rawpods.sh} -> dsv3/ (flattened; the dir IS the experiment) - library README: add dsv3 row to the Models table, redraw the layout tree - repoint every cross-link (kimi-k2 README/conf/kubernetes README/manifest) to the new sibling location; drop the defunct benchmarks/ Files-table row - kimi-k2 keeps the SFT recipe (conf/kimi_k2_sft.py, checkpoint conversion, manifests) FSx paths (/fsx/kimi-k2), PVC (fsx-kimi-k2), and namespace (kimi-k2-bench) left as-is: they are live cluster artifacts, not repo structure.

…overwrite A/B harness Adds the infrastructure to run a proper with/without-UCCL dispatcher A/B for BOTH DeepSeek-V3 and Kimi-K2, with all logs preserved on FSx (no overwrite) for retro and a per-iteration loss-curve equivalence check. - kimi-k2/benchmarks/bench_kimi_k2_pretrain.py: builds the LITERAL Kimi-K2 provider (384 experts, 64 heads, n_group=1, MLA, mtp=0) via AutoBridge, grafts it onto the DSV3 recipe's mock-data/training scaffolding, re-derives the 61-layer pipeline layout (mtp-aware -> last stage ["loss"]), and runs pretrain(). Mirrors bench_dsv3_pretrain.py; same dispatcher A/B toggle + B300-allowlist guard + overlap handling + LOSS_PROBE. - run-ab-rawpods.sh: promoted from dsv3/ to the library level; now MODEL-aware (dsv3|kimi-k2) and writes each run to a unique /fsx/megatron-bridge-bench/$CAMPAIGN_ID/ ...dir (rank logs, env.txt, STATUS). rank-0 refuses to clobber a completed run. - bench/run-campaign.sh: drives the 16-run matrix (2 models x {mb1,mb4}x{ovl on,off} x {alltoall,deepep}) serially, asserts the EFA-active gate per run, frees GPUs between runs. - bench/parse-runs.py: scrapes rank-0 logs -> per-run loss_curve.csv + campaign index.csv (mean steady-state iter time, TFLOP/s, tok/s, stalls, efa/uccl validity). Rendered pod YAML validated; parser unit-tested on a synthetic Megatron log.

…ecompute coupling Resolves the two config-validity gaps that made kimi_k2_sft.py fail out-of-the-box: 1. pipeline_model_parallel_layout was left None — errors on the uneven 61-layer/PP=8 split. Now reuses the DSV3 recipe's MTP-aware layout helper with an explicit mtp_num_layers=0 (Kimi K2 ships no MTP; the helper defaults an ABSENT attr to 1, which would corrupt the layout), plus a loud guard for unsupported (PP,VPP). 2. MOE_A2A_OVERLAP=on (the default) violated core 0.17.1 constraints: overlap requires VPP (PP>1) + recompute fully OFF, but the conf pinned vpp=None + full recompute. VPP/recompute/layout are now finalized together per regime, and delay_wgrad_compute is held OFF (the only validated setting). Validated on the image (Bridge 0.4.2, nemo:26.04.01) via a config-build gate in a GPU-less pod: both regimes produce experts=384 mtp=0 with correct (8,1)/(8,2) 61-decoder layouts (SFT_CONF_GATE_OK). Mirrors bench_kimi_k2_pretrain.py.

…validated on real logs) Megatron prints the per-iteration training line (iteration N/M | elapsed time per iteration (ms) | lm loss | TFLOP/s/GPU) on the LAST rank (last PP stage), not rank 0 — rank-0.log only carries the Bridge 'Step Time' logger and the EFA/UCCL init lines. parse-runs.py was reading rank-0.log and would have produced zero loss curves. Now parses the highest-numbered rank-<r>.log for iteration lines and rank-0.log for EFA/UCCL signals. Validated against the preserved 2026-06-01 raw logs: reproduces the published alltoall mb4 overlap=on result exactly (mean 5.9786s vs published 5.978s, 178.62 vs 178.6 TFLOP/s/GPU, 0 stalls), and extracts the per-iteration lm-loss curve.

run-ab-rawpods.sh moved to the library level and is now MODEL-aware with no-overwrite run dirs. Update every reference and run example in dsv3/README.md (+ RESULTS.md 'How to reproduce'): MODEL=dsv3 bash ../run-ab-rawpods.sh, the campaign driver and post-hoc parser rows, the new /fsx/megatron-bridge-bench/<CAMPAIGN_ID>/ log layout, and the corrected scrape guidance (per-iteration training line is printed by the LAST rank's log, not rank-0; parse with ../bench/parse-runs.py).

core's comm-overlap setup asserts mtp_num_layers is None or == 1 when overlap_moe_expert_parallel_comm is enabled — int 0 trips it ('MTP layernum only supports 1...'), which killed the K2 overlap=on canary at runtime_config_update(). None is correct on every path: the DSV3 layout helper treats None as no-MTP (['loss'] tail) and the overlap assert accepts it. Fixed in both the benchmark entrypoint and the SFT conf; gate now exercises comm_overlap.setup() too.

…PP8 + PP4 appendix) First dispatcher A/B measured on literal Kimi-K2 (384 experts, AutoBridge provider), campaign 20260604T083049Z-uccl-ab-pp8-32n-v2 on 256x B300 (TP8/PP8/EP32/DP4): - mb=4: UCCL deepep -33.8% iter time w/ overlap (3.834 vs 5.793 s), -35.1% w/o overlap (6.041 vs 9.303 s) - mb=1 no-overlap: NCCL faster by 16.3% (12.88 vs 14.98 s) - same crossover regime as DSV3 - Work equivalence: per-iteration loss curves match to <=4.1e-4 relative over iters 1-10 in all three cell pairs (bf16 round-off, no offset) - Coverage note: mb1+overlap pair and the same-campaign DSV3 re-run were lost to the Capacity Block expiry (6/8 K2 cells banked); partial logs preserved on FSx Also: 16-node PP4 appendix table (within-cell deltas only), dsv3/RESULTS.md block-expiry note, kimi-k2 README updated to point at the literal-K2 results.

…ubdir) Move dsv3 bench entrypoint + RESULTS.md into dsv3/benchmarks/ so both models follow the same <model>/benchmarks/ convention. Update all cross-references (READMEs, conf, manifest, campaign launcher) and the top-level tree; every relative link re-verified to resolve.

…out) dsv3 was benchmark-only; give it a full-parameter SFT half so both models expose the same two workloads in an identical layout (README, conf/, kubernetes/, benchmarks/). - New dsv3/conf/dsv3_sft.py: DeepSeek-V3-256 full-param SFT ConfigContainer, sibling of kimi_k2_sft.py. Keeps DeepSeek-V3's 1 MTP layer (K2 has none), 256 experts (8/EP rank), built on the same deepseek_v3 recipe + flex/deepep dispatch. Scaffolded + unrun; TODOs flagged inline (matches kimi-k2's SFT). - New dsv3/kubernetes/ PyTorchJob manifest + deploy README (dsv3-sft). - Promote kimi-k2/1.convert-checkpoint.sh to a shared, model-agnostic convert-checkpoint.sh at the library level, parameterized by HF_MODEL_ID/HF_REVISION/FSX_ROOT/TRUST_REMOTE_CODE. DeepSeek-V3 conversion uses the dedicated deepseek_v3_bridge shipped in Bridge v0.4.2. - dsv3/README.md: add Full-parameter SFT section; update Files table + the stale conf/kimi_k2_sft.py references (dsv3 now has its own conf). - Top README + kimi-k2 README: shared convert script, both models = SFT + A/B. Validated: bash -n, py_compile, manifest YAML (3 docs), all relative links resolve, copyright headers present. SFT recipes are scaffolded/unrun; the measured dispatcher A/Bs remain the validated part.

KeitaW added 3 commits May 31, 2026 12:52

KeitaW changed the title ~~Add Megatron-Bridge Kimi K2 UCCL-EP over EFA SFT test case~~ Add Megatron-Bridge Kimi K2 SFT test case + validated UCCL-EP-over-EFA dispatcher A/B (256× B300) Jun 3, 2026

KeitaW added 8 commits June 3, 2026 12:41

KeitaW changed the title ~~Add Megatron-Bridge Kimi K2 SFT test case + validated UCCL-EP-over-EFA dispatcher A/B (256× B300)~~ Add Megatron-Bridge Kimi K2 test case + UCCL-EP-over-EFA dispatcher A/B validated on DSV3 and literal K2 (256× B300) Jun 4, 2026

KeitaW marked this pull request as ready for review June 9, 2026 00:10

KeitaW changed the title ~~Add Megatron-Bridge Kimi K2 test case + UCCL-EP-over-EFA dispatcher A/B validated on DSV3 and literal K2 (256× B300)~~ Add Megatron-Bridge Kimi-K2 + DeepSeek-V3 test cases: full-parameter SFT + UCCL-EP-over-EFA dispatcher A/B (A/B validated on 256× B300) Jun 9, 2026

KeitaW requested a review from pbelevich June 10, 2026 22:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Megatron-Bridge Kimi-K2 + DeepSeek-V3 test cases: full-parameter SFT + UCCL-EP-over-EFA dispatcher A/B (A/B validated on 256× B300)#1116

Add Megatron-Bridge Kimi-K2 + DeepSeek-V3 test cases: full-parameter SFT + UCCL-EP-over-EFA dispatcher A/B (A/B validated on 256× B300)#1116
KeitaW wants to merge 13 commits into
mainfrom
feat/kimi-k2-uccl-bridge

KeitaW commented May 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KeitaW commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Structure

Components

Benchmark result — UCCL-EP vs NCCL all-to-all (256× B300, measured 2026-06-01)

Benchmark result — literal Kimi-K2, 384 experts (256× B300, measured 2026-06-04)

Testing performed

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KeitaW commented May 31, 2026 •

edited

Loading