[Feature] [Tracking] End-to-end bring-up of DeepSeek-V3.2 distributed inference with DP2+TP8+EP

### Summary

Track the end-to-end bring-up of **DeepSeek-V3.2** distributed inference on **16× Ascend NPU cards** using a **DP2 + TP8 + EP16** parallelism strategy. The work extends the existing single-layer forward computation (both prefill and decode) into a complete distributed inference pipeline, covering HCCL-based inter-NPU communication, MLA-aware KV cache, W8A8 INT8 quantization, and a continuous-batching scheduler.

This issue covers **Phase 1: functional correctness on a 16-card single-cluster setup with greedy decoding**. Serving-scale optimizations (PagedAttention, speculative decoding, disaggregated prefill) are tracked separately.

---

### Hardware & Software Environment

| Item | Detail |
|------|--------|
| **NPU** | Ascend card × 16 (2 nodes × 8 cards per node) |
| **Precision** | **No native FP8 support**; primary inference precision is **BF16 / W8A8 INT8** |
| **Intra-node interconnect** | HCCS (scale-up) |
| **Inter-node interconnect** | RoCE v2 RDMA (scale-out) |
| **Communication library** | HCCL (Huawei Collective Communication Library) |
| **Kernel framework** | PyPTO (this repo) |

---

### Topology

```
Total NPUs: 16 (2 nodes × 8 cards each)

  DP groups:   2  (cards 0–7 = DP rank 0,  cards 8–15 = DP rank 1)
  TP groups:   8  (one TP rank per card within each DP group)
  EP group:   16  (all cards — expert layers form a single AllToAll group)

Expert sharding:
  num_experts = 256,  top-k = 2
  num_local_experts = 256 / 16 = 16 experts per card

Process groups (HCCL backend):
  tp_group:  {0..7}  and  {8..15}
  dp_group:  {0,8}, {1,9}, ..., {7,15}
  ep_group:  {0..15}

Intra-node comms (TP AllReduce):  HCCS — fast, low latency
Inter-node comms (EP AllToAll, DP coord):  RoCE v2 — higher latency;
                                            overlap strategy needed
```

---

### Motivation / Use Case

The repo currently only ships **single-layer forward kernels** for prefill and decode. There is no end-to-end distributed pipeline that can:

1. Correctly route and shard a 671B MoE model across 16 Ascend NPU cards;
2. Serve as a reference implementation for DP+TP+EP combined parallelism on the PyPTO stack;
3. Demonstrate the full integration story: PyPTO kernel + HCCL communication + MLA KV cache + W8A8 quantization + scheduler + tokenizer.

DeepSeek-V3.2 is chosen as the first target because:
- Its MoE architecture (256 experts, top-2 routing) is naturally EP-friendly and well-suited to Ascend's inter-card communication profile — MoE requires less AllReduce traffic per layer than a dense model;
- DeepSeek uses MLA (Multi-head Latent Attention), which significantly reduces KV cache memory pressure;
- W8A8 INT8 is the natural precision target on Ascend, with hardware-level co-design enabling near-BF16 accuracy.

---

### Proposed Components / Scope

#### A. Communication Layer (HCCL-based)

All collectives use the **HCCL backend**. Three distinct communication patterns are needed, each using a different process group:

- **TP AllReduce** — within each 8-card TP group, after attention output projection and MLP row-parallel linear; runs over HCCS intra-node
- **EP AllToAll dispatch/combine** — across all 16 cards for MoE layers:
  - Pre-communication of per-rank send counts so each rank knows recv sizes before the data AllToAll
  - Variable-length AllToAllv: split tensors dispatched via HCCL EP group
  - Token reordering/packing by destination rank before dispatch
  - EP AllToAll crosses the node boundary over RoCE v2 — latency is higher than intra-node HCCS; overlap with shared-expert compute is desirable
- **DP synchronization coordinator** — dummy forward pass injection for idle DP ranks; both DP groups must enter every HCCL collective together or all ranks hang

Files to add under `distributed/`:
- `hccl_tp_comm.py` — TP AllReduce helpers
- `hccl_ep_dispatch.py` — AllToAllv with send-count pre-exchange
- `dp_coord.py` — DP idle-rank coordinator

#### B. MoE Router + Expert Execution

- **Router (gating)** — runs on all ranks; outputs `[num_tokens, 256]` logits, selects top-2 experts per token
- **Token dispatch** — reorder hidden states by destination EP rank, call `hccl_ep_dispatch.dispatch()`
- **Grouped GEMM** — per-card computation over locally owned 16 experts with variable batch sizes, implemented via PyPTO grouped matmul kernels; must handle zero-token experts gracefully
- **Result combine** — `hccl_ep_dispatch.combine()`, weighted sum of top-2 expert outputs per token

Files to add under `models/deepseek/`:
- `moe_router.py`
- `moe_layer.py` — full router + dispatch + grouped GEMM + combine

#### C. KV Cache (MLA-aware)

DeepSeek uses MLA with compressed KV latents, which significantly reduces per-card memory pressure:

- Cache the **KV latent** `c_kv` of shape `[num_layers, max_seq, kv_lora_rank=512]` per card, instead of full K/V tensors — roughly 32× smaller than standard MHA cache
- Re-project latent → K, V at decode time on the fly using PyPTO matmul kernels
- TP sharding: KV latent replicated across the TP group (small enough to be practical)
- Attention kernel: use PyPTO prefill attention for prefill; a dedicated low-latency decode attention kernel for the decode path — the generic attention path has known poor small-batch decode performance on Ascend

Files to add under `cache/`:
- `mla_kv_cache.py` — MLA latent allocation per card
- `rope.py` — RoPE cos/sin table (`theta=10000`, YaRN optional)

#### D. Quantization: W8A8 INT8

> **Hardware note**: Ascend cards do not support FP8 natively. The Cube compute engine is co-designed for **W8A8 INT8**, with a parallel Vector Unit performing dynamic activation scaling in real time to maintain near-BF16 accuracy. This replaces the FP8/DeepGEMM strategy used on NVIDIA GPUs.

- **Weight quantization (W8)**: expert and attention weights stored as INT8 with per-channel scales; each EP rank loads only its 16 experts' INT8 weights at init time — never load all 256 and discard
- **Activation quantization (A8)**: dynamic per-token INT8 quantization before each Cube GEMM; compute `amax`, scale, quantize online via PyPTO Vector-unit kernels
- **INT8 grouped GEMM**: expert compute via PyPTO's INT8 grouped matmul kernel backed by the Cube engine
- **KV cache quantization (optional)**: INT8 KV cache for MLA latent to further reduce HBM pressure; dequantize before attention

Files to add under `quantization/`:
- `w8a8_linear.py` — dynamic per-token INT8 activation quant + INT8 matmul
- `w8a8_grouped_gemm.py` — INT8 grouped matmul wrapper for MoE experts
- `weight_loader.py` — EP-sliced W8A8 checkpoint loader (INT8 safetensors format)

#### E. Scheduler + Generation Loop

- Maintains request queue with per-sequence KV cache slots and sequence lengths
- Implements **continuous batching**: mix prefill and decode sequences in the same forward pass
- **Prefill path**: large token batches, compute-bound on Cube INT8 GEMMs
- **Decode path**: 1 token per active sequence, HBM-bandwidth bound; EP AllToAll over RoCE v2 is the latency bottleneck — consider overlapping with shared-expert (non-EP) computation
- **DP load balancer**: distributes incoming requests evenly across DP2 ranks; idle DP ranks still incur AllToAll synchronization cost so imbalance is expensive

Files to add under `engine/`:
- `scheduler.py` — request queue, KV slot tracking, batch construction
- `generation.py` — prefill → decode → sampling loop; tokenizer + chat template
- `dp_load_balancer.py` — DP2 request distribution

---

### Implementation Plan

| Phase | Task | Est. effort |
|-------|------|-------------|
| **A. Communication** | HCCL TP AllReduce, EP AllToAllv with send-count pre-exchange, DP sync coordinator, stream budget analysis | 3–4 days |
| **B. MoE Layer** | Router, token dispatch/combine, grouped GEMM (BF16 baseline first) | 2–3 days |
| **C. KV Cache** | MLA latent cache, decode attention kernel integration, RoPE tables | 2 days |
| **D. W8A8 Quantization** | INT8 weight/activation quant, INT8 grouped GEMM, EP-sliced W8A8 weight loader | 3–4 days |
| **E. Scheduler** | Continuous batching, prefill/decode paths, DP load balancing, RoCE latency overlap | 2–3 days |
| **F. Accuracy Validation** | Per-layer cos_sim ≥ 0.999 vs HF reference; greedy 20-step token match | 2 days |
| **G. E2E Demo** | Full prompt → text generation on 16 NPUs, readable output, throughput log | (A–F done) |

**Milestone**: ~2.5–3 weeks to first functional end-to-end greedy decoding demo.

---

### Sub-tasks

- [ ] Land design doc `examples/models/deepseek_671b/e2e_guide_ascend.md`
- **Phase A — Communication**
  - [ ] `distributed/hccl_tp_comm.py`
  - [ ] `distributed/hccl_ep_dispatch.py` — AllToAllv with send-count pre-exchange
  - [ ] `distributed/dp_coord.py`
  - [ ] HCCL stream budget analysis for TP=8 + EP=16 combined
  - [ ] Unit test: AllToAll correctness on 16-card dummy tensors; validate RoCE v2 inter-node path
- **Phase B — MoE Layer**
  - [ ] `models/deepseek/moe_router.py`
  - [ ] `models/deepseek/moe_layer.py` — BF16 baseline
  - [ ] Single MoE layer accuracy vs HF reference (cos_sim > 0.999)
- **Phase C — KV Cache**
  - [ ] `cache/mla_kv_cache.py`
  - [ ] `cache/rope.py`
  - [ ] Validate MLA re-projection matches HF per-step output
- **Phase D — W8A8 Quantization**
  - [ ] `quantization/w8a8_linear.py`
  - [ ] `quantization/w8a8_grouped_gemm.py`
  - [ ] `quantization/weight_loader.py`
  - [ ] W8A8 vs BF16 accuracy comparison (cos_sim ≥ 0.99 per expert layer)
- **Phase E — Scheduler**
  - [ ] `engine/scheduler.py`
  - [ ] `engine/generation.py`
  - [ ] `engine/dp_load_balancer.py`
  - [ ] Profile EP AllToAll latency over RoCE; assess overlap with shared expert compute
- **Phase F — Validation**
  - [ ] Per-layer cos_sim ≥ 0.999 vs HF (sampled layers 0, 10, 20, 30, 60)
  - [ ] Final hidden cos_sim ≥ 0.98
  - [ ] Greedy 20-step token ID match with HF transformers
- **Phase G — Demo**
  - [ ] End-to-end Chinese/English Q&A on 16 NPUs with readable output
  - [ ] Prefill and decode throughput numbers logged (tokens/s per card)

---

### Alternatives Considered

1. **FP8 quantization (as used on H100/H200)**: not applicable — Ascend cards have no native FP8 hardware support. W8A8 INT8 is the correct and hardware-optimized precision target.
2. **BF16 throughout, skip quantization**: 671B in BF16 requires ~1.3 TB of card memory — far exceeds the 16-card ceiling and leaves no room for activations or KV cache. W8A8 halves weight memory, making the deployment feasible.
3. **Use DeepGEMM or CUTLASS for grouped GEMM**: both are CUDA-only. The equivalent on Ascend is PyPTO's grouped matmul kernel backed by the Cube engine.
4. **TP16 only, no DP/EP**: avoids the inter-node RoCE AllToAll, but puts all 16 cards in one TP group — TP AllReduce across RoCE is even more frequent than EP AllToAll and likely worse. DP+EP is the correct topology for MoE on multi-node.
5. **Smaller MoE model first (e.g., Qwen3-MoE-57B)**: viable as a faster debug loop and still exercises the full EP AllToAll path. Can be added as a prerequisite if 671B iteration speed proves too slow in early phases.

---

### Risks

| Risk | Mitigation |
|------|------------|
| **HCCL stream exhaustion under graph-capture mode** — combining TP+EP+DP collectives can exceed available stream count | Disable graph capture during initial bringup; profile stream usage before re-enabling |
| **EP AllToAll latency over RoCE v2** — inter-node AllToAll in decode phase is latency-bound | Profile carefully; overlap AllToAll with shared-expert (non-EP) compute in Phase E |
| **Grouped GEMM zero-batch handling** — some EP ranks may receive zero tokens for certain experts; grouped GEMM must not crash or produce garbage | Unit test grouped GEMM with zero-batch experts before integration |
| **W8A8 accuracy across 61 MoE layers** — dynamic activation scaling error can accumulate | Per-layer cos_sim gate in Phase F; land W8A8 only after BF16 baseline passes |
| **MLA RoPE variant** — absorbed vs. non-absorbed RoPE affects whether RoPE is applied before or after KV compression; wrong variant causes silent accuracy degradation | Validate against HF `modeling_deepseek_v3.py`; flag variant explicitly in `rope.py` |
| **W8A8 weight loader memory spike** — loading BF16 checkpoint then quantizing on-the-fly can transiently OOM | Convert to INT8 safetensors offline; load INT8 weights directly and EP-slice before loading |
| **Inter-node process group initialization** — HCCL multi-node setup requires correct RoCE NIC binding; misconfiguration leads to silent hangs | Test basic AllReduce across both nodes before any model code |

---

### Additional Context

- Starting point: existing single-layer prefill/decode PyPTO kernels
- HF reference: `transformers/models/deepseek_v3/modeling_deepseek_v3.py`
- DeepSeek-V3 technical report: https://arxiv.org/abs/2412.19437
- CloudMatrix-Infer (production DeepSeek-R1 serving on Ascend, key reference for W8A8 MoE execution): https://arxiv.org/abs/2506.12708
-  vLLM-Ascend deloyment scheme: https://docs.vllm.ai/projects/ascend/en/latest/tutorials/models/DeepSeek-V3.2.html

Related:
- #126 #123 #150 #134 
- `[existing single-layer decode kernel PR/issue]`
- `[existing single-layer prefill kernel PR/issue]`

Follow-up:
- `[Feature] [Tracking] PagedAttention + multi-request serving for DeepSeek-V3.2 on Ascend NPU` — to be filed after Phase G

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] [Tracking] End-to-end bring-up of DeepSeek-V3.2 distributed inference with DP2+TP8+EP #156

Summary

Hardware & Software Environment

Topology

Motivation / Use Case

Proposed Components / Scope

A. Communication Layer (HCCL-based)

B. MoE Router + Expert Execution

C. KV Cache (MLA-aware)

D. Quantization: W8A8 INT8

E. Scheduler + Generation Loop

Implementation Plan

Sub-tasks

Alternatives Considered

Risks

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Item	Detail
NPU	Ascend card × 16 (2 nodes × 8 cards per node)
Precision	No native FP8 support; primary inference precision is BF16 / W8A8 INT8
Intra-node interconnect	HCCS (scale-up)
Inter-node interconnect	RoCE v2 RDMA (scale-out)
Communication library	HCCL (Huawei Collective Communication Library)
Kernel framework	PyPTO (this repo)

Phase	Task	Est. effort
A. Communication	HCCL TP AllReduce, EP AllToAllv with send-count pre-exchange, DP sync coordinator, stream budget analysis	3–4 days
B. MoE Layer	Router, token dispatch/combine, grouped GEMM (BF16 baseline first)	2–3 days
C. KV Cache	MLA latent cache, decode attention kernel integration, RoPE tables	2 days
D. W8A8 Quantization	INT8 weight/activation quant, INT8 grouped GEMM, EP-sliced W8A8 weight loader	3–4 days
E. Scheduler	Continuous batching, prefill/decode paths, DP load balancing, RoCE latency overlap	2–3 days
F. Accuracy Validation	Per-layer cos_sim ≥ 0.999 vs HF reference; greedy 20-step token match	2 days
G. E2E Demo	Full prompt → text generation on 16 NPUs, readable output, throughput log	(A–F done)

Risk	Mitigation
HCCL stream exhaustion under graph-capture mode — combining TP+EP+DP collectives can exceed available stream count	Disable graph capture during initial bringup; profile stream usage before re-enabling
EP AllToAll latency over RoCE v2 — inter-node AllToAll in decode phase is latency-bound	Profile carefully; overlap AllToAll with shared-expert (non-EP) compute in Phase E
Grouped GEMM zero-batch handling — some EP ranks may receive zero tokens for certain experts; grouped GEMM must not crash or produce garbage	Unit test grouped GEMM with zero-batch experts before integration
W8A8 accuracy across 61 MoE layers — dynamic activation scaling error can accumulate	Per-layer cos_sim gate in Phase F; land W8A8 only after BF16 baseline passes
MLA RoPE variant — absorbed vs. non-absorbed RoPE affects whether RoPE is applied before or after KV compression; wrong variant causes silent accuracy degradation	Validate against HF `modeling_deepseek_v3.py`; flag variant explicitly in `rope.py`
W8A8 weight loader memory spike — loading BF16 checkpoint then quantizing on-the-fly can transiently OOM	Convert to INT8 safetensors offline; load INT8 weights directly and EP-slice before loading
Inter-node process group initialization — HCCL multi-node setup requires correct RoCE NIC binding; misconfiguration leads to silent hangs	Test basic AllReduce across both nodes before any model code

[Feature] [Tracking] End-to-end bring-up of DeepSeek-V3.2 distributed inference with DP2+TP8+EP #156

Description

Summary

Hardware & Software Environment

Topology

Motivation / Use Case

Proposed Components / Scope

A. Communication Layer (HCCL-based)

B. MoE Router + Expert Execution

C. KV Cache (MLA-aware)

D. Quantization: W8A8 INT8

E. Scheduler + Generation Loop

Implementation Plan

Sub-tasks

Alternatives Considered

Risks

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions