Summary
Track the end-to-end bring-up of DeepSeek-V3.2 distributed inference on 16× Ascend NPU cards using a DP2 + TP8 + EP16 parallelism strategy. The work extends the existing single-layer forward computation (both prefill and decode) into a complete distributed inference pipeline, covering HCCL-based inter-NPU communication, MLA-aware KV cache, W8A8 INT8 quantization, and a continuous-batching scheduler.
This issue covers Phase 1: functional correctness on a 16-card single-cluster setup with greedy decoding. Serving-scale optimizations (PagedAttention, speculative decoding, disaggregated prefill) are tracked separately.
Hardware & Software Environment
| Item |
Detail |
| NPU |
Ascend card × 16 (2 nodes × 8 cards per node) |
| Precision |
No native FP8 support; primary inference precision is BF16 / W8A8 INT8 |
| Intra-node interconnect |
HCCS (scale-up) |
| Inter-node interconnect |
RoCE v2 RDMA (scale-out) |
| Communication library |
HCCL (Huawei Collective Communication Library) |
| Kernel framework |
PyPTO (this repo) |
Topology
Total NPUs: 16 (2 nodes × 8 cards each)
DP groups: 2 (cards 0–7 = DP rank 0, cards 8–15 = DP rank 1)
TP groups: 8 (one TP rank per card within each DP group)
EP group: 16 (all cards — expert layers form a single AllToAll group)
Expert sharding:
num_experts = 256, top-k = 2
num_local_experts = 256 / 16 = 16 experts per card
Process groups (HCCL backend):
tp_group: {0..7} and {8..15}
dp_group: {0,8}, {1,9}, ..., {7,15}
ep_group: {0..15}
Intra-node comms (TP AllReduce): HCCS — fast, low latency
Inter-node comms (EP AllToAll, DP coord): RoCE v2 — higher latency;
overlap strategy needed
Motivation / Use Case
The repo currently only ships single-layer forward kernels for prefill and decode. There is no end-to-end distributed pipeline that can:
- Correctly route and shard a 671B MoE model across 16 Ascend NPU cards;
- Serve as a reference implementation for DP+TP+EP combined parallelism on the PyPTO stack;
- Demonstrate the full integration story: PyPTO kernel + HCCL communication + MLA KV cache + W8A8 quantization + scheduler + tokenizer.
DeepSeek-V3.2 is chosen as the first target because:
- Its MoE architecture (256 experts, top-2 routing) is naturally EP-friendly and well-suited to Ascend's inter-card communication profile — MoE requires less AllReduce traffic per layer than a dense model;
- DeepSeek uses MLA (Multi-head Latent Attention), which significantly reduces KV cache memory pressure;
- W8A8 INT8 is the natural precision target on Ascend, with hardware-level co-design enabling near-BF16 accuracy.
Proposed Components / Scope
A. Communication Layer (HCCL-based)
All collectives use the HCCL backend. Three distinct communication patterns are needed, each using a different process group:
- TP AllReduce — within each 8-card TP group, after attention output projection and MLP row-parallel linear; runs over HCCS intra-node
- EP AllToAll dispatch/combine — across all 16 cards for MoE layers:
- Pre-communication of per-rank send counts so each rank knows recv sizes before the data AllToAll
- Variable-length AllToAllv: split tensors dispatched via HCCL EP group
- Token reordering/packing by destination rank before dispatch
- EP AllToAll crosses the node boundary over RoCE v2 — latency is higher than intra-node HCCS; overlap with shared-expert compute is desirable
- DP synchronization coordinator — dummy forward pass injection for idle DP ranks; both DP groups must enter every HCCL collective together or all ranks hang
Files to add under distributed/:
hccl_tp_comm.py — TP AllReduce helpers
hccl_ep_dispatch.py — AllToAllv with send-count pre-exchange
dp_coord.py — DP idle-rank coordinator
B. MoE Router + Expert Execution
- Router (gating) — runs on all ranks; outputs
[num_tokens, 256] logits, selects top-2 experts per token
- Token dispatch — reorder hidden states by destination EP rank, call
hccl_ep_dispatch.dispatch()
- Grouped GEMM — per-card computation over locally owned 16 experts with variable batch sizes, implemented via PyPTO grouped matmul kernels; must handle zero-token experts gracefully
- Result combine —
hccl_ep_dispatch.combine(), weighted sum of top-2 expert outputs per token
Files to add under models/deepseek/:
moe_router.py
moe_layer.py — full router + dispatch + grouped GEMM + combine
C. KV Cache (MLA-aware)
DeepSeek uses MLA with compressed KV latents, which significantly reduces per-card memory pressure:
- Cache the KV latent
c_kv of shape [num_layers, max_seq, kv_lora_rank=512] per card, instead of full K/V tensors — roughly 32× smaller than standard MHA cache
- Re-project latent → K, V at decode time on the fly using PyPTO matmul kernels
- TP sharding: KV latent replicated across the TP group (small enough to be practical)
- Attention kernel: use PyPTO prefill attention for prefill; a dedicated low-latency decode attention kernel for the decode path — the generic attention path has known poor small-batch decode performance on Ascend
Files to add under cache/:
mla_kv_cache.py — MLA latent allocation per card
rope.py — RoPE cos/sin table (theta=10000, YaRN optional)
D. Quantization: W8A8 INT8
Hardware note: Ascend cards do not support FP8 natively. The Cube compute engine is co-designed for W8A8 INT8, with a parallel Vector Unit performing dynamic activation scaling in real time to maintain near-BF16 accuracy. This replaces the FP8/DeepGEMM strategy used on NVIDIA GPUs.
- Weight quantization (W8): expert and attention weights stored as INT8 with per-channel scales; each EP rank loads only its 16 experts' INT8 weights at init time — never load all 256 and discard
- Activation quantization (A8): dynamic per-token INT8 quantization before each Cube GEMM; compute
amax, scale, quantize online via PyPTO Vector-unit kernels
- INT8 grouped GEMM: expert compute via PyPTO's INT8 grouped matmul kernel backed by the Cube engine
- KV cache quantization (optional): INT8 KV cache for MLA latent to further reduce HBM pressure; dequantize before attention
Files to add under quantization/:
w8a8_linear.py — dynamic per-token INT8 activation quant + INT8 matmul
w8a8_grouped_gemm.py — INT8 grouped matmul wrapper for MoE experts
weight_loader.py — EP-sliced W8A8 checkpoint loader (INT8 safetensors format)
E. Scheduler + Generation Loop
- Maintains request queue with per-sequence KV cache slots and sequence lengths
- Implements continuous batching: mix prefill and decode sequences in the same forward pass
- Prefill path: large token batches, compute-bound on Cube INT8 GEMMs
- Decode path: 1 token per active sequence, HBM-bandwidth bound; EP AllToAll over RoCE v2 is the latency bottleneck — consider overlapping with shared-expert (non-EP) computation
- DP load balancer: distributes incoming requests evenly across DP2 ranks; idle DP ranks still incur AllToAll synchronization cost so imbalance is expensive
Files to add under engine/:
scheduler.py — request queue, KV slot tracking, batch construction
generation.py — prefill → decode → sampling loop; tokenizer + chat template
dp_load_balancer.py — DP2 request distribution
Implementation Plan
| Phase |
Task |
Est. effort |
| A. Communication |
HCCL TP AllReduce, EP AllToAllv with send-count pre-exchange, DP sync coordinator, stream budget analysis |
3–4 days |
| B. MoE Layer |
Router, token dispatch/combine, grouped GEMM (BF16 baseline first) |
2–3 days |
| C. KV Cache |
MLA latent cache, decode attention kernel integration, RoPE tables |
2 days |
| D. W8A8 Quantization |
INT8 weight/activation quant, INT8 grouped GEMM, EP-sliced W8A8 weight loader |
3–4 days |
| E. Scheduler |
Continuous batching, prefill/decode paths, DP load balancing, RoCE latency overlap |
2–3 days |
| F. Accuracy Validation |
Per-layer cos_sim ≥ 0.999 vs HF reference; greedy 20-step token match |
2 days |
| G. E2E Demo |
Full prompt → text generation on 16 NPUs, readable output, throughput log |
(A–F done) |
Milestone: ~2.5–3 weeks to first functional end-to-end greedy decoding demo.
Sub-tasks
Alternatives Considered
- FP8 quantization (as used on H100/H200): not applicable — Ascend cards have no native FP8 hardware support. W8A8 INT8 is the correct and hardware-optimized precision target.
- BF16 throughout, skip quantization: 671B in BF16 requires ~1.3 TB of card memory — far exceeds the 16-card ceiling and leaves no room for activations or KV cache. W8A8 halves weight memory, making the deployment feasible.
- Use DeepGEMM or CUTLASS for grouped GEMM: both are CUDA-only. The equivalent on Ascend is PyPTO's grouped matmul kernel backed by the Cube engine.
- TP16 only, no DP/EP: avoids the inter-node RoCE AllToAll, but puts all 16 cards in one TP group — TP AllReduce across RoCE is even more frequent than EP AllToAll and likely worse. DP+EP is the correct topology for MoE on multi-node.
- Smaller MoE model first (e.g., Qwen3-MoE-57B): viable as a faster debug loop and still exercises the full EP AllToAll path. Can be added as a prerequisite if 671B iteration speed proves too slow in early phases.
Risks
| Risk |
Mitigation |
| HCCL stream exhaustion under graph-capture mode — combining TP+EP+DP collectives can exceed available stream count |
Disable graph capture during initial bringup; profile stream usage before re-enabling |
| EP AllToAll latency over RoCE v2 — inter-node AllToAll in decode phase is latency-bound |
Profile carefully; overlap AllToAll with shared-expert (non-EP) compute in Phase E |
| Grouped GEMM zero-batch handling — some EP ranks may receive zero tokens for certain experts; grouped GEMM must not crash or produce garbage |
Unit test grouped GEMM with zero-batch experts before integration |
| W8A8 accuracy across 61 MoE layers — dynamic activation scaling error can accumulate |
Per-layer cos_sim gate in Phase F; land W8A8 only after BF16 baseline passes |
| MLA RoPE variant — absorbed vs. non-absorbed RoPE affects whether RoPE is applied before or after KV compression; wrong variant causes silent accuracy degradation |
Validate against HF modeling_deepseek_v3.py; flag variant explicitly in rope.py |
| W8A8 weight loader memory spike — loading BF16 checkpoint then quantizing on-the-fly can transiently OOM |
Convert to INT8 safetensors offline; load INT8 weights directly and EP-slice before loading |
| Inter-node process group initialization — HCCL multi-node setup requires correct RoCE NIC binding; misconfiguration leads to silent hangs |
Test basic AllReduce across both nodes before any model code |
Additional Context
Related:
Follow-up:
[Feature] [Tracking] PagedAttention + multi-request serving for DeepSeek-V3.2 on Ascend NPU — to be filed after Phase G
Summary
Track the end-to-end bring-up of DeepSeek-V3.2 distributed inference on 16× Ascend NPU cards using a DP2 + TP8 + EP16 parallelism strategy. The work extends the existing single-layer forward computation (both prefill and decode) into a complete distributed inference pipeline, covering HCCL-based inter-NPU communication, MLA-aware KV cache, W8A8 INT8 quantization, and a continuous-batching scheduler.
This issue covers Phase 1: functional correctness on a 16-card single-cluster setup with greedy decoding. Serving-scale optimizations (PagedAttention, speculative decoding, disaggregated prefill) are tracked separately.
Hardware & Software Environment
Topology
Motivation / Use Case
The repo currently only ships single-layer forward kernels for prefill and decode. There is no end-to-end distributed pipeline that can:
DeepSeek-V3.2 is chosen as the first target because:
Proposed Components / Scope
A. Communication Layer (HCCL-based)
All collectives use the HCCL backend. Three distinct communication patterns are needed, each using a different process group:
Files to add under
distributed/:hccl_tp_comm.py— TP AllReduce helpershccl_ep_dispatch.py— AllToAllv with send-count pre-exchangedp_coord.py— DP idle-rank coordinatorB. MoE Router + Expert Execution
[num_tokens, 256]logits, selects top-2 experts per tokenhccl_ep_dispatch.dispatch()hccl_ep_dispatch.combine(), weighted sum of top-2 expert outputs per tokenFiles to add under
models/deepseek/:moe_router.pymoe_layer.py— full router + dispatch + grouped GEMM + combineC. KV Cache (MLA-aware)
DeepSeek uses MLA with compressed KV latents, which significantly reduces per-card memory pressure:
c_kvof shape[num_layers, max_seq, kv_lora_rank=512]per card, instead of full K/V tensors — roughly 32× smaller than standard MHA cacheFiles to add under
cache/:mla_kv_cache.py— MLA latent allocation per cardrope.py— RoPE cos/sin table (theta=10000, YaRN optional)D. Quantization: W8A8 INT8
amax, scale, quantize online via PyPTO Vector-unit kernelsFiles to add under
quantization/:w8a8_linear.py— dynamic per-token INT8 activation quant + INT8 matmulw8a8_grouped_gemm.py— INT8 grouped matmul wrapper for MoE expertsweight_loader.py— EP-sliced W8A8 checkpoint loader (INT8 safetensors format)E. Scheduler + Generation Loop
Files to add under
engine/:scheduler.py— request queue, KV slot tracking, batch constructiongeneration.py— prefill → decode → sampling loop; tokenizer + chat templatedp_load_balancer.py— DP2 request distributionImplementation Plan
Milestone: ~2.5–3 weeks to first functional end-to-end greedy decoding demo.
Sub-tasks
examples/models/deepseek_671b/e2e_guide_ascend.mddistributed/hccl_tp_comm.pydistributed/hccl_ep_dispatch.py— AllToAllv with send-count pre-exchangedistributed/dp_coord.pymodels/deepseek/moe_router.pymodels/deepseek/moe_layer.py— BF16 baselinecache/mla_kv_cache.pycache/rope.pyquantization/w8a8_linear.pyquantization/w8a8_grouped_gemm.pyquantization/weight_loader.pyengine/scheduler.pyengine/generation.pyengine/dp_load_balancer.pyAlternatives Considered
Risks
modeling_deepseek_v3.py; flag variant explicitly inrope.pyAdditional Context
transformers/models/deepseek_v3/modeling_deepseek_v3.pyRelated:
[existing single-layer decode kernel PR/issue][existing single-layer prefill kernel PR/issue]Follow-up:
[Feature] [Tracking] PagedAttention + multi-request serving for DeepSeek-V3.2 on Ascend NPU— to be filed after Phase G