Skip to content

[Feature] [Tracking] End-to-end bring-up of DeepSeek-V3.2 distributed inference with DP2+TP8+EP #156

@bumble0918

Description

@bumble0918

Summary

Track the end-to-end bring-up of DeepSeek-V3.2 distributed inference on 16× Ascend NPU cards using a DP2 + TP8 + EP16 parallelism strategy. The work extends the existing single-layer forward computation (both prefill and decode) into a complete distributed inference pipeline, covering HCCL-based inter-NPU communication, MLA-aware KV cache, W8A8 INT8 quantization, and a continuous-batching scheduler.

This issue covers Phase 1: functional correctness on a 16-card single-cluster setup with greedy decoding. Serving-scale optimizations (PagedAttention, speculative decoding, disaggregated prefill) are tracked separately.


Hardware & Software Environment

Item Detail
NPU Ascend card × 16 (2 nodes × 8 cards per node)
Precision No native FP8 support; primary inference precision is BF16 / W8A8 INT8
Intra-node interconnect HCCS (scale-up)
Inter-node interconnect RoCE v2 RDMA (scale-out)
Communication library HCCL (Huawei Collective Communication Library)
Kernel framework PyPTO (this repo)

Topology

Total NPUs: 16 (2 nodes × 8 cards each)

  DP groups:   2  (cards 0–7 = DP rank 0,  cards 8–15 = DP rank 1)
  TP groups:   8  (one TP rank per card within each DP group)
  EP group:   16  (all cards — expert layers form a single AllToAll group)

Expert sharding:
  num_experts = 256,  top-k = 2
  num_local_experts = 256 / 16 = 16 experts per card

Process groups (HCCL backend):
  tp_group:  {0..7}  and  {8..15}
  dp_group:  {0,8}, {1,9}, ..., {7,15}
  ep_group:  {0..15}

Intra-node comms (TP AllReduce):  HCCS — fast, low latency
Inter-node comms (EP AllToAll, DP coord):  RoCE v2 — higher latency;
                                            overlap strategy needed

Motivation / Use Case

The repo currently only ships single-layer forward kernels for prefill and decode. There is no end-to-end distributed pipeline that can:

  1. Correctly route and shard a 671B MoE model across 16 Ascend NPU cards;
  2. Serve as a reference implementation for DP+TP+EP combined parallelism on the PyPTO stack;
  3. Demonstrate the full integration story: PyPTO kernel + HCCL communication + MLA KV cache + W8A8 quantization + scheduler + tokenizer.

DeepSeek-V3.2 is chosen as the first target because:

  • Its MoE architecture (256 experts, top-2 routing) is naturally EP-friendly and well-suited to Ascend's inter-card communication profile — MoE requires less AllReduce traffic per layer than a dense model;
  • DeepSeek uses MLA (Multi-head Latent Attention), which significantly reduces KV cache memory pressure;
  • W8A8 INT8 is the natural precision target on Ascend, with hardware-level co-design enabling near-BF16 accuracy.

Proposed Components / Scope

A. Communication Layer (HCCL-based)

All collectives use the HCCL backend. Three distinct communication patterns are needed, each using a different process group:

  • TP AllReduce — within each 8-card TP group, after attention output projection and MLP row-parallel linear; runs over HCCS intra-node
  • EP AllToAll dispatch/combine — across all 16 cards for MoE layers:
    • Pre-communication of per-rank send counts so each rank knows recv sizes before the data AllToAll
    • Variable-length AllToAllv: split tensors dispatched via HCCL EP group
    • Token reordering/packing by destination rank before dispatch
    • EP AllToAll crosses the node boundary over RoCE v2 — latency is higher than intra-node HCCS; overlap with shared-expert compute is desirable
  • DP synchronization coordinator — dummy forward pass injection for idle DP ranks; both DP groups must enter every HCCL collective together or all ranks hang

Files to add under distributed/:

  • hccl_tp_comm.py — TP AllReduce helpers
  • hccl_ep_dispatch.py — AllToAllv with send-count pre-exchange
  • dp_coord.py — DP idle-rank coordinator

B. MoE Router + Expert Execution

  • Router (gating) — runs on all ranks; outputs [num_tokens, 256] logits, selects top-2 experts per token
  • Token dispatch — reorder hidden states by destination EP rank, call hccl_ep_dispatch.dispatch()
  • Grouped GEMM — per-card computation over locally owned 16 experts with variable batch sizes, implemented via PyPTO grouped matmul kernels; must handle zero-token experts gracefully
  • Result combinehccl_ep_dispatch.combine(), weighted sum of top-2 expert outputs per token

Files to add under models/deepseek/:

  • moe_router.py
  • moe_layer.py — full router + dispatch + grouped GEMM + combine

C. KV Cache (MLA-aware)

DeepSeek uses MLA with compressed KV latents, which significantly reduces per-card memory pressure:

  • Cache the KV latent c_kv of shape [num_layers, max_seq, kv_lora_rank=512] per card, instead of full K/V tensors — roughly 32× smaller than standard MHA cache
  • Re-project latent → K, V at decode time on the fly using PyPTO matmul kernels
  • TP sharding: KV latent replicated across the TP group (small enough to be practical)
  • Attention kernel: use PyPTO prefill attention for prefill; a dedicated low-latency decode attention kernel for the decode path — the generic attention path has known poor small-batch decode performance on Ascend

Files to add under cache/:

  • mla_kv_cache.py — MLA latent allocation per card
  • rope.py — RoPE cos/sin table (theta=10000, YaRN optional)

D. Quantization: W8A8 INT8

Hardware note: Ascend cards do not support FP8 natively. The Cube compute engine is co-designed for W8A8 INT8, with a parallel Vector Unit performing dynamic activation scaling in real time to maintain near-BF16 accuracy. This replaces the FP8/DeepGEMM strategy used on NVIDIA GPUs.

  • Weight quantization (W8): expert and attention weights stored as INT8 with per-channel scales; each EP rank loads only its 16 experts' INT8 weights at init time — never load all 256 and discard
  • Activation quantization (A8): dynamic per-token INT8 quantization before each Cube GEMM; compute amax, scale, quantize online via PyPTO Vector-unit kernels
  • INT8 grouped GEMM: expert compute via PyPTO's INT8 grouped matmul kernel backed by the Cube engine
  • KV cache quantization (optional): INT8 KV cache for MLA latent to further reduce HBM pressure; dequantize before attention

Files to add under quantization/:

  • w8a8_linear.py — dynamic per-token INT8 activation quant + INT8 matmul
  • w8a8_grouped_gemm.py — INT8 grouped matmul wrapper for MoE experts
  • weight_loader.py — EP-sliced W8A8 checkpoint loader (INT8 safetensors format)

E. Scheduler + Generation Loop

  • Maintains request queue with per-sequence KV cache slots and sequence lengths
  • Implements continuous batching: mix prefill and decode sequences in the same forward pass
  • Prefill path: large token batches, compute-bound on Cube INT8 GEMMs
  • Decode path: 1 token per active sequence, HBM-bandwidth bound; EP AllToAll over RoCE v2 is the latency bottleneck — consider overlapping with shared-expert (non-EP) computation
  • DP load balancer: distributes incoming requests evenly across DP2 ranks; idle DP ranks still incur AllToAll synchronization cost so imbalance is expensive

Files to add under engine/:

  • scheduler.py — request queue, KV slot tracking, batch construction
  • generation.py — prefill → decode → sampling loop; tokenizer + chat template
  • dp_load_balancer.py — DP2 request distribution

Implementation Plan

Phase Task Est. effort
A. Communication HCCL TP AllReduce, EP AllToAllv with send-count pre-exchange, DP sync coordinator, stream budget analysis 3–4 days
B. MoE Layer Router, token dispatch/combine, grouped GEMM (BF16 baseline first) 2–3 days
C. KV Cache MLA latent cache, decode attention kernel integration, RoPE tables 2 days
D. W8A8 Quantization INT8 weight/activation quant, INT8 grouped GEMM, EP-sliced W8A8 weight loader 3–4 days
E. Scheduler Continuous batching, prefill/decode paths, DP load balancing, RoCE latency overlap 2–3 days
F. Accuracy Validation Per-layer cos_sim ≥ 0.999 vs HF reference; greedy 20-step token match 2 days
G. E2E Demo Full prompt → text generation on 16 NPUs, readable output, throughput log (A–F done)

Milestone: ~2.5–3 weeks to first functional end-to-end greedy decoding demo.


Sub-tasks

  • Land design doc examples/models/deepseek_671b/e2e_guide_ascend.md
  • Phase A — Communication
    • distributed/hccl_tp_comm.py
    • distributed/hccl_ep_dispatch.py — AllToAllv with send-count pre-exchange
    • distributed/dp_coord.py
    • HCCL stream budget analysis for TP=8 + EP=16 combined
    • Unit test: AllToAll correctness on 16-card dummy tensors; validate RoCE v2 inter-node path
  • Phase B — MoE Layer
    • models/deepseek/moe_router.py
    • models/deepseek/moe_layer.py — BF16 baseline
    • Single MoE layer accuracy vs HF reference (cos_sim > 0.999)
  • Phase C — KV Cache
    • cache/mla_kv_cache.py
    • cache/rope.py
    • Validate MLA re-projection matches HF per-step output
  • Phase D — W8A8 Quantization
    • quantization/w8a8_linear.py
    • quantization/w8a8_grouped_gemm.py
    • quantization/weight_loader.py
    • W8A8 vs BF16 accuracy comparison (cos_sim ≥ 0.99 per expert layer)
  • Phase E — Scheduler
    • engine/scheduler.py
    • engine/generation.py
    • engine/dp_load_balancer.py
    • Profile EP AllToAll latency over RoCE; assess overlap with shared expert compute
  • Phase F — Validation
    • Per-layer cos_sim ≥ 0.999 vs HF (sampled layers 0, 10, 20, 30, 60)
    • Final hidden cos_sim ≥ 0.98
    • Greedy 20-step token ID match with HF transformers
  • Phase G — Demo
    • End-to-end Chinese/English Q&A on 16 NPUs with readable output
    • Prefill and decode throughput numbers logged (tokens/s per card)

Alternatives Considered

  1. FP8 quantization (as used on H100/H200): not applicable — Ascend cards have no native FP8 hardware support. W8A8 INT8 is the correct and hardware-optimized precision target.
  2. BF16 throughout, skip quantization: 671B in BF16 requires ~1.3 TB of card memory — far exceeds the 16-card ceiling and leaves no room for activations or KV cache. W8A8 halves weight memory, making the deployment feasible.
  3. Use DeepGEMM or CUTLASS for grouped GEMM: both are CUDA-only. The equivalent on Ascend is PyPTO's grouped matmul kernel backed by the Cube engine.
  4. TP16 only, no DP/EP: avoids the inter-node RoCE AllToAll, but puts all 16 cards in one TP group — TP AllReduce across RoCE is even more frequent than EP AllToAll and likely worse. DP+EP is the correct topology for MoE on multi-node.
  5. Smaller MoE model first (e.g., Qwen3-MoE-57B): viable as a faster debug loop and still exercises the full EP AllToAll path. Can be added as a prerequisite if 671B iteration speed proves too slow in early phases.

Risks

Risk Mitigation
HCCL stream exhaustion under graph-capture mode — combining TP+EP+DP collectives can exceed available stream count Disable graph capture during initial bringup; profile stream usage before re-enabling
EP AllToAll latency over RoCE v2 — inter-node AllToAll in decode phase is latency-bound Profile carefully; overlap AllToAll with shared-expert (non-EP) compute in Phase E
Grouped GEMM zero-batch handling — some EP ranks may receive zero tokens for certain experts; grouped GEMM must not crash or produce garbage Unit test grouped GEMM with zero-batch experts before integration
W8A8 accuracy across 61 MoE layers — dynamic activation scaling error can accumulate Per-layer cos_sim gate in Phase F; land W8A8 only after BF16 baseline passes
MLA RoPE variant — absorbed vs. non-absorbed RoPE affects whether RoPE is applied before or after KV compression; wrong variant causes silent accuracy degradation Validate against HF modeling_deepseek_v3.py; flag variant explicitly in rope.py
W8A8 weight loader memory spike — loading BF16 checkpoint then quantizing on-the-fly can transiently OOM Convert to INT8 safetensors offline; load INT8 weights directly and EP-slice before loading
Inter-node process group initialization — HCCL multi-node setup requires correct RoCE NIC binding; misconfiguration leads to silent hangs Test basic AllReduce across both nodes before any model code

Additional Context

Related:

Follow-up:

  • [Feature] [Tracking] PagedAttention + multi-request serving for DeepSeek-V3.2 on Ascend NPU — to be filed after Phase G

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

Status
No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions