Skip to content

Add Megatron-Bridge Kimi-K2 + DeepSeek-V3 test cases: full-parameter SFT + UCCL-EP-over-EFA dispatcher A/B (A/B validated on 256× B300)#1116

Open
KeitaW wants to merge 13 commits into
mainfrom
feat/kimi-k2-uccl-bridge
Open

Add Megatron-Bridge Kimi-K2 + DeepSeek-V3 test cases: full-parameter SFT + UCCL-EP-over-EFA dispatcher A/B (A/B validated on 256× B300)#1116
KeitaW wants to merge 13 commits into
mainfrom
feat/kimi-k2-uccl-bridge

Conversation

@KeitaW

@KeitaW KeitaW commented May 31, 2026

Copy link
Copy Markdown
Collaborator

What this PR does

Adds a reproducible test case for full-parameter SFT of Kimi K2 (Moonshot AI's 1.04T-param DeepSeek-V3-family MoE) on Amazon EKS with NVIDIA Megatron-Bridge, using UCCL-EP's EFA-native deep_ep drop-in for the expert-parallel all-to-all. This replaces NVSHMEM/IB-verbs-bound NVIDIA DeepEP (which cannot run on EFA) without patching Megatron-Core — UCCL ships a top-level deep_ep shadow module so Megatron-Core's flex/deepep dispatcher transparently routes over EFA via UCCL + GDRCopy.

It also lands the first end-to-end DeepEP-on-EFA-vs-NCCL-all-to-all training A/B on B300 (see results below) — no such measurement existed in the literature — measured on two substrates: the DeepSeek-V3 256-expert recipe (2026-06-01) and the literal 384-expert Kimi-K2 built via AutoBridge (2026-06-04), with per-iteration loss-curve work-equivalence on both.

Structure

Laid out framework/library/model so future Megatron-Bridge models can reuse the environment:

  • 3.test_cases/megatron/megatron-bridge/model-agnostic shared env (Dockerfile, build + single-node sanity scripts, shared convert-checkpoint.sh, CI build test).
  • 3.test_cases/megatron/megatron-bridge/kimi-k2/ — Kimi K2 (384-expert): full-parameter SFT recipe (conf + K8s manifests) and the dispatcher A/B (benchmarks/).
  • 3.test_cases/megatron/megatron-bridge/dsv3/ — DeepSeek-V3 (256-expert): the same two workloads in a structurally identical layout — a full-parameter SFT recipe and the measured dispatcher A/B below.

The image bakes no model config; each model mounts its conf/ at /workspace/conf at runtime via a ConfigMap, so one image serves every model under the library.

Components

  • Framework: NVIDIA Megatron-Bridge — NGC nemo:26.04.01 (CUDA 13.1, PyTorch 2.11), which already ships Megatron-Bridge v0.4.2 / Megatron-Core 0.17.1 with the flex/deepep dispatcher. (The older 25.11.01 base shipped Bridge 0.2.0, whose flex/deepep GPU allowlist rejects p6-b300; 0.4.0+ fixes it.)
  • Models/Workloads: Kimi-K2 (1.04T MoE, 384 experts, MLA) and DeepSeek-V3 (671B MoE, 256 experts, MLA), each with full-parameter SFT and the UCCL-EP-over-EFA dispatcher A/B, in a structurally identical layout
  • Hardware target: 32× p6-b300.48xlarge (256× B300, sm_103), 16× EFA/node
  • Orchestration: Kubernetes / EKS (Kubeflow PyTorchJob, etcd rendezvous, optional KAI gang scheduling)
  • Transport: UCCL-EP over AWS EFA (deep_ep shadow, pinned commit 0dc87eb; IB verbs/NVSHMEM removed)

Benchmark result — UCCL-EP vs NCCL all-to-all (256× B300, measured 2026-06-01)

A live 32× p6-b300.48xlarge (256× B300) A/B that swaps only the Megatron-Core MoE token dispatcher — everything else (model, data, parallelism TP8/PP8/EP32/DP4, seq 4096, GBS 256, bf16, balanced routing, image) held byte-identical. Substrate is the deepseek_v3 recipe — DeepSeek-V3 256-expert MoE, the architecture family Kimi K2 belongs to but not the literal 384-expert Kimi K2 (see RESULTS.md for the substrate caveat).

The dispatcher winner crosses over with micro-batch granularity:

micro-batch overlap NCCL all-to-all UCCL deep_ep dispatcher delta
1 off 12.54 s 14.12 s NCCL +12.6% faster
4 off 9.77 s 6.26 s UCCL −36.0% faster
4 on 5.98 s 3.84 s UCCL −35.8% faster

Mean training-iteration time (lower = better) over 16 steady-state iters after warmup, 0 stalls, EFA active on every rank. At the throughput-efficient operating point (micro-batch ≥ 4) UCCL deep_ep is ~36% faster, and the advantage holds under deployment-realistic 1F1B overlap (contrary to the usual expectation that overlap compresses the dispatcher delta toward parity). NCCL wins only at micro-batch 1 (64 tiny dispatches — UCCL-EP's per-dispatch overhead unamortized), an operating point no throughput-tuned run uses.

Work-equivalence verified (no token dropping): both arms' config dumps are byte-identical except the dispatcher selector (no moe_expert_capacity_factor/moe_token_drop_policy set → drop-free by construction), and an iteration-1 loss match (deepep 11.897349 vs alltoall 11.897517, rel diff 1.4e-5 — bf16 round-off, identical num_tokens). So the −36% is a genuine equal-work speedup.

Benchmark result — literal Kimi-K2, 384 experts (256× B300, measured 2026-06-04)

The same A/B re-run on the literal Kimi-K2 (61 layers, 384 routed experts, top-8, MLA, no MTP), with the provider built by AutoBridge.from_hf_pretrained("moonshotai/Kimi-K2-Base", trust_remote_code=True).to_megatron_provider() — same image, same TP8/PP8/EP32/DP4 layout, mock data, 24 iters/run with log_interval=1 (first 4 dropped). See kimi-k2/benchmarks/RESULTS.md.

micro-batch overlap NCCL all-to-all UCCL deep_ep dispatcher delta
1 off 12.88 s 14.98 s NCCL +16.3% faster
4 off 9.30 s 6.04 s UCCL −35.1% faster
4 on 5.79 s 3.83 s UCCL −33.8% faster

The DSV3 findings transfer to the real K2 architecture: UCCL deepep −34/−35% at mb=4 in both overlap regimes; NCCL faster only at mb=1 no-overlap. Work-equivalence held across the full run: per-iteration lm loss curves for deepep vs alltoall match within ≤4.1e-4 relative over iters 1–10 in all three cell pairs (bf16 round-off accumulation, no systematic offset). The mb=1+overlap pair was not measured — the Capacity Block expired mid-campaign after 6 of 8 cells (partial logs preserved); a 16-node TP8/PP4/EP32/DP2 appendix table is in the results doc (within-cell deltas only, not comparable across layouts).

Testing performed

  • Local CI-equivalent lint: shell bash -n + copyright + set -e, Python py_compile + copyright, Dockerfile version pins (EFA 1.48.0, GDRCopy v2.5.2, CUDA 13.1) and pinned FROM, no :latest/binaries/experimental.
  • Both PyTorchJob manifests validated as multi-doc YAML; ConfigMap volume/mount placement checked.
  • test_megatron_bridge_uccl.py builds the image and asserts import deep_ep resolves to the UCCL wrapper (deep_ep.Buffer present).
  • Single-node 8-GPU sanity gate (2.sanity-singlenode.sh): Gates 1–4 pass (EFA device present, UCCL deep_ep active with Buffer, MCore flex/deepep config fields, NCCL all-reduce over EFA). Gate 5's hand-rolled MoEFlexTokenDispatcher micro-step is stale on Core 0.17.1 (predates the ProcessGroupCollection API) — not a UCCL/image fault; the real pretrain() path builds the process groups internally and the multi-node benchmark runs clean through MoEFlexTokenDispatcher(backend="deepep"), which is the authoritative end-to-end dispatch check.
  • Multi-node validation: the 256× B300 dispatcher A/B above ran end-to-end on both arms; every rank logged NET/OFI Selected provider is efa, fabric is efa-direct (found 16 nics) (true EFA RDMA, no socket fallback) and the UCCL arm logged the UCCL-EP EFA proxy active.

Scope note. Both the DeepSeek-V3 256-expert and the literal 384-expert Kimi-K2 dispatcher A/Bs are validated at 256× B300 (tables above). The K2 provider comes from AutoBridge (no hand-derived routing overrides needed — K2's HF config carries n_group=1/topk_group=1 and num_nextn_predict_layers=0, and the AutoBridge-built provider preserves them). Benchmarks use random init + mock data and measure throughput + dispatcher work-equivalence; real-data SFT convergence is not benchmarked here. The K2 HF→MCore checkpoint conversion path was exercised end-to-end (1.6 TiB MCore checkpoint produced). Both models now also ship a full-parameter SFT recipe (conf/ + K8s manifests) and share one model-agnostic convert-checkpoint.sh; these SFT recipes — including the new DeepSeek-V3 one — are scaffolded and unrun (TODOs flagged inline; DeepSeek-V3 keeps its 1 MTP layer where K2 has none), distinct from the measured dispatcher A/Bs above. The shared convert-checkpoint.sh runs the AutoBridge import_ckpt weight-conversion path, which is exercised for K2 but flagged unverified for DeepSeek-V3 on this image.

Checklist

  • Dockerfile pins versions (EFA, GDRCopy, CUDA, UCCL commit) and pinned FROM
  • Scripts have copyright headers
  • README documents prerequisites and usage
  • No large binaries committed

KeitaW added 3 commits May 31, 2026 12:52
Full-parameter SFT of Kimi K2 (1.04T MoE) on EKS with NVIDIA Megatron-Bridge,
using UCCL-EP's EFA-native deep_ep drop-in for the expert-parallel all-to-all
(replacing NVSHMEM/IB-bound NVIDIA DeepEP without patching Megatron-Core).

Structured framework/library/model: the model-agnostic Megatron-Bridge + UCCL
environment image lives at the library level (megatron-bridge/) and is shared by
per-model recipes; kimi-k2/ holds the model-specific conf, manifests, and
benchmarks. The SFT config is mounted at runtime via a ConfigMap, not baked in.
…mo:26.04.01)

- Migrate the shared env image base to nvcr.io/nvidia/nemo:26.04.01 (Megatron-Bridge
  0.4.2 / Megatron-Core 0.17.1) to fix the B300 flex/deepep GPU allowlist; build-verify
  asserts deep_ep resolves to UCCL's venv copy (not NVIDIA dist-packages).
- Add the validated dispatcher A/B harness: run-ab-rawpods.sh (raw ranked Pods +
  headless Service + static torchrun rendezvous; no PyTorchJob CRD required) driving
  bench_dsv3_pretrain.py (recipe-native DSV3 256-expert substrate; single MOE_DISPATCHER
  toggle; overlap/VPP/recompute + forced router load-balancing + loss-probe handling).
- RESULTS: UCCL deepep is ~36% faster than NCCL all-to-all at micro-batch >= 4, and the
  win holds under deployment-realistic 1F1B overlap (-35.8% on, -36.0% off); NCCL wins
  only at mb=1 (overhead-bound). Work-equivalence verified two ways (drop-free config +
  iteration-1 loss match).
- Reconcile all READMEs to the validated raw-pod path; correct the now-false IB-removal
  and combine-backward-unproven claims; fill the EFA/UCCL log signatures.
- Remove the superseded PyTorchJob benchmark path (0.comm-microbench.sh,
  1.run-ab-dispatcher.sh, kimi-k2-bench-pytorchjob.yaml-template).
- Genericize org-specific identifiers (account, cluster name, kubectl context,
  capacity-block / reservation IDs) to placeholders / required env overrides.
…mi-K2)

The A/B substrate is the deepseek_v3 recipe with DSV3's values (256 routed experts,
128 attention heads), not Kimi-K2 (384 experts, 64 heads, ~1.04T vs ~671B). Relabel the
benchmark titles, intros, and model descriptions in RESULTS.md, benchmarks/README.md, and
the library README to "DeepSeek-V3 256-expert MoE", with an explicit substrate caveat that
it is the architecture family Kimi-K2 belongs to but the literal 384-expert Kimi-K2 number
is unrun. The Kimi-K2 SFT-recipe scaffolding (conf/, manifests, checkpoint conversion)
keeps its Kimi-K2 naming — only the executed benchmark is relabeled. Also fix a
mean-vs-median wording slip in the library README result note (metric of record is mean).
@KeitaW KeitaW changed the title Add Megatron-Bridge Kimi K2 UCCL-EP over EFA SFT test case Add Megatron-Bridge Kimi K2 SFT test case + validated UCCL-EP-over-EFA dispatcher A/B (256× B300) Jun 3, 2026
KeitaW added 8 commits June 3, 2026 12:41
Kimi-K2-Base's HF config uses auto_map -> configuration_deepseek.DeepseekV3Config
(custom code), so AutoBridge.from_hf_pretrained fails with "contains custom code
which must be executed" unless trust_remote_code=True. Validated on the image
(Bridge 0.4.2, nemo:26.04.01): with the flag, AutoBridge routes the
DeepseekV3ForCausalLM architecture to DeepSeekV3Bridge and builds the literal
Kimi-K2 provider (384 experts, moe_router_num_groups=1, 64 heads, MLA) from the
HF config. Replaces the now-resolved TODO(validate against image) marker.
…model case

The dispatcher A/B benchmark ran on the recipe-native DeepSeek-V3 256-expert MoE,
not the literal 384-expert Kimi-K2, so keeping it under kimi-k2/ conflated two
independent experiments. Promote it to a sibling model test case under the shared
megatron-bridge library env:

- git mv kimi-k2/benchmarks/{README,RESULTS,bench_dsv3_pretrain.py,run-ab-rawpods.sh}
  -> dsv3/ (flattened; the dir IS the experiment)
- library README: add dsv3 row to the Models table, redraw the layout tree
- repoint every cross-link (kimi-k2 README/conf/kubernetes README/manifest) to the
  new sibling location; drop the defunct benchmarks/ Files-table row
- kimi-k2 keeps the SFT recipe (conf/kimi_k2_sft.py, checkpoint conversion, manifests)

FSx paths (/fsx/kimi-k2), PVC (fsx-kimi-k2), and namespace (kimi-k2-bench) left as-is:
they are live cluster artifacts, not repo structure.
…overwrite A/B harness

Adds the infrastructure to run a proper with/without-UCCL dispatcher A/B for BOTH
DeepSeek-V3 and Kimi-K2, with all logs preserved on FSx (no overwrite) for retro and
a per-iteration loss-curve equivalence check.

- kimi-k2/benchmarks/bench_kimi_k2_pretrain.py: builds the LITERAL Kimi-K2 provider
  (384 experts, 64 heads, n_group=1, MLA, mtp=0) via AutoBridge, grafts it onto the
  DSV3 recipe's mock-data/training scaffolding, re-derives the 61-layer pipeline layout
  (mtp-aware -> last stage ["loss"]), and runs pretrain(). Mirrors bench_dsv3_pretrain.py;
  same dispatcher A/B toggle + B300-allowlist guard + overlap handling + LOSS_PROBE.
- run-ab-rawpods.sh: promoted from dsv3/ to the library level; now MODEL-aware
  (dsv3|kimi-k2) and writes each run to a unique /fsx/megatron-bridge-bench/$CAMPAIGN_ID/
  ...dir (rank logs, env.txt, STATUS). rank-0 refuses to clobber a completed run.
- bench/run-campaign.sh: drives the 16-run matrix (2 models x {mb1,mb4}x{ovl on,off} x
  {alltoall,deepep}) serially, asserts the EFA-active gate per run, frees GPUs between runs.
- bench/parse-runs.py: scrapes rank-0 logs -> per-run loss_curve.csv + campaign index.csv
  (mean steady-state iter time, TFLOP/s, tok/s, stalls, efa/uccl validity).

Rendered pod YAML validated; parser unit-tested on a synthetic Megatron log.
…ecompute coupling

Resolves the two config-validity gaps that made kimi_k2_sft.py fail out-of-the-box:

1. pipeline_model_parallel_layout was left None — errors on the uneven 61-layer/PP=8
   split. Now reuses the DSV3 recipe's MTP-aware layout helper with an explicit
   mtp_num_layers=0 (Kimi K2 ships no MTP; the helper defaults an ABSENT attr to 1,
   which would corrupt the layout), plus a loud guard for unsupported (PP,VPP).
2. MOE_A2A_OVERLAP=on (the default) violated core 0.17.1 constraints: overlap
   requires VPP (PP>1) + recompute fully OFF, but the conf pinned vpp=None + full
   recompute. VPP/recompute/layout are now finalized together per regime, and
   delay_wgrad_compute is held OFF (the only validated setting).

Validated on the image (Bridge 0.4.2, nemo:26.04.01) via a config-build gate in a
GPU-less pod: both regimes produce experts=384 mtp=0 with correct (8,1)/(8,2)
61-decoder layouts (SFT_CONF_GATE_OK). Mirrors bench_kimi_k2_pretrain.py.
…validated on real logs)

Megatron prints the per-iteration training line (iteration N/M | elapsed time per
iteration (ms) | lm loss | TFLOP/s/GPU) on the LAST rank (last PP stage), not rank 0 —
rank-0.log only carries the Bridge 'Step Time' logger and the EFA/UCCL init lines.
parse-runs.py was reading rank-0.log and would have produced zero loss curves.

Now parses the highest-numbered rank-<r>.log for iteration lines and rank-0.log for
EFA/UCCL signals. Validated against the preserved 2026-06-01 raw logs: reproduces the
published alltoall mb4 overlap=on result exactly (mean 5.9786s vs published 5.978s,
178.62 vs 178.6 TFLOP/s/GPU, 0 stalls), and extracts the per-iteration lm-loss curve.
run-ab-rawpods.sh moved to the library level and is now MODEL-aware with no-overwrite
run dirs. Update every reference and run example in dsv3/README.md (+ RESULTS.md
'How to reproduce'): MODEL=dsv3 bash ../run-ab-rawpods.sh, the campaign driver and
post-hoc parser rows, the new /fsx/megatron-bridge-bench/<CAMPAIGN_ID>/ log layout,
and the corrected scrape guidance (per-iteration training line is printed by the
LAST rank's log, not rank-0; parse with ../bench/parse-runs.py).
core's comm-overlap setup asserts mtp_num_layers is None or == 1 when
overlap_moe_expert_parallel_comm is enabled — int 0 trips it ('MTP layernum only
supports 1...'), which killed the K2 overlap=on canary at runtime_config_update().
None is correct on every path: the DSV3 layout helper treats None as no-MTP
(['loss'] tail) and the overlap assert accepts it. Fixed in both the benchmark
entrypoint and the SFT conf; gate now exercises comm_overlap.setup() too.
…PP8 + PP4 appendix)

First dispatcher A/B measured on literal Kimi-K2 (384 experts, AutoBridge
provider), campaign 20260604T083049Z-uccl-ab-pp8-32n-v2 on 256x B300
(TP8/PP8/EP32/DP4):

- mb=4: UCCL deepep -33.8% iter time w/ overlap (3.834 vs 5.793 s),
  -35.1% w/o overlap (6.041 vs 9.303 s)
- mb=1 no-overlap: NCCL faster by 16.3% (12.88 vs 14.98 s) - same
  crossover regime as DSV3
- Work equivalence: per-iteration loss curves match to <=4.1e-4 relative
  over iters 1-10 in all three cell pairs (bf16 round-off, no offset)
- Coverage note: mb1+overlap pair and the same-campaign DSV3 re-run were
  lost to the Capacity Block expiry (6/8 K2 cells banked); partial logs
  preserved on FSx

Also: 16-node PP4 appendix table (within-cell deltas only), dsv3/RESULTS.md
block-expiry note, kimi-k2 README updated to point at the literal-K2 results.
@KeitaW KeitaW changed the title Add Megatron-Bridge Kimi K2 SFT test case + validated UCCL-EP-over-EFA dispatcher A/B (256× B300) Add Megatron-Bridge Kimi K2 test case + UCCL-EP-over-EFA dispatcher A/B validated on DSV3 and literal K2 (256× B300) Jun 4, 2026
…ubdir)

Move dsv3 bench entrypoint + RESULTS.md into dsv3/benchmarks/ so both
models follow the same <model>/benchmarks/ convention. Update all
cross-references (READMEs, conf, manifest, campaign launcher) and the
top-level tree; every relative link re-verified to resolve.
@KeitaW KeitaW marked this pull request as ready for review June 9, 2026 00:10
…out)

dsv3 was benchmark-only; give it a full-parameter SFT half so both models
expose the same two workloads in an identical layout (README, conf/,
kubernetes/, benchmarks/).

- New dsv3/conf/dsv3_sft.py: DeepSeek-V3-256 full-param SFT ConfigContainer,
  sibling of kimi_k2_sft.py. Keeps DeepSeek-V3's 1 MTP layer (K2 has none),
  256 experts (8/EP rank), built on the same deepseek_v3 recipe + flex/deepep
  dispatch. Scaffolded + unrun; TODOs flagged inline (matches kimi-k2's SFT).
- New dsv3/kubernetes/ PyTorchJob manifest + deploy README (dsv3-sft).
- Promote kimi-k2/1.convert-checkpoint.sh to a shared, model-agnostic
  convert-checkpoint.sh at the library level, parameterized by
  HF_MODEL_ID/HF_REVISION/FSX_ROOT/TRUST_REMOTE_CODE. DeepSeek-V3 conversion
  uses the dedicated deepseek_v3_bridge shipped in Bridge v0.4.2.
- dsv3/README.md: add Full-parameter SFT section; update Files table + the
  stale conf/kimi_k2_sft.py references (dsv3 now has its own conf).
- Top README + kimi-k2 README: shared convert script, both models = SFT + A/B.

Validated: bash -n, py_compile, manifest YAML (3 docs), all relative links
resolve, copyright headers present. SFT recipes are scaffolded/unrun; the
measured dispatcher A/Bs remain the validated part.
@KeitaW KeitaW changed the title Add Megatron-Bridge Kimi K2 test case + UCCL-EP-over-EFA dispatcher A/B validated on DSV3 and literal K2 (256× B300) Add Megatron-Bridge Kimi-K2 + DeepSeek-V3 test cases: full-parameter SFT + UCCL-EP-over-EFA dispatcher A/B (A/B validated on 256× B300) Jun 9, 2026
@KeitaW KeitaW requested a review from pbelevich June 10, 2026 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant