diff --git a/2025-Ascend-Innovation-Contest/S1/MoE/HUST_ASCEND/README.md b/2025-Ascend-Innovation-Contest/S1/MoE/HUST_ASCEND/README.md new file mode 100644 index 00000000..3d940bf6 --- /dev/null +++ b/2025-Ascend-Innovation-Contest/S1/MoE/HUST_ASCEND/README.md @@ -0,0 +1,137 @@ +# MindNLP 模型优化详细说明 (DeepseekMoE & Qwen2-MoE) +## 评测结果 + +| 评测指标 | 平均得分 | +|---------|---------| +| 峰值显存得分 | 100 | +| Prefill时延得分 | 109.3774 | +| Decode时延得分 | 360.0786 | +| **总分** | **189.8187** | + +## 优化模型 + +本项目针对以下两个MoE(Mixture of Experts)模型进行了昇腾NPU适配与性能优化: + +1. **DeepSeek-MoE-16B-Chat** - 深度求索开源的MoE大模型 +2. **Qwen1.5-MoE-A2.7B-Chat** - 通义千问开源的MoE大模型 + +--- + +## 核心优化技术 + +### 1. MindSpore算子适配优化 + +#### 1.1 ops.split代替切片 + + +```python +def rotate_half(x): + """Rotates half the hidden dims of the input.""" + # 优化前 + # x1 = x[..., : x.shape[-1] // 2] + # x2 = x[..., x.shape[-1] // 2 :] + + # 优化后 + x1, x2 = ops.split(x,x.shape[-1]//2,dim=-1) + return ops.cat((-x2, x1), dim=-1) +``` + + +#### 1.2 使用mint.narrow替代切片操作 + +```python + def forward(self, x, seq_len=None): + # x: [bs, num_attention_heads, seq_len, head_size] + if seq_len > self.max_seq_len_cached: + self._set_cos_sin_cache(seq_len=seq_len, dtype=x.dtype) + + return ( + # self.cos_cached[:seq_len].to(dtype=x.dtype), + # self.sin_cached[:seq_len].to(dtype=x.dtype), + ops.narrow(self.cos_cached, 0, 0, seq_len).to(dtype=x.dtype), + ops.narrow(self.sin_cached, 0, 0, seq_len).to(dtype=x.dtype), + ) +``` + +**收益**:mint.narrow避免切片操作的额外内存拷贝。 + +--- + +### 2. FlashAttention优化 + +```python +else: # prefill开启 + # 融合后,采用mindspore的融合算子,flash_attention_score + sparse_mode = 0 + if attention_mask is not None: + attention_mask = ~attention_mask + + if self.is_causal: + sparse_mode = 3 + global_attn_mask = ops.ones(2048, 2048, dtype=mindspore.bool_).triu(diagonal=1) + attn_output = mindspore.ops.flash_attention_score(query_states, key_states, value_states, +head_num=self.num_heads, input_layout='BNSD', real_shift = None,padding_mask = None, attn_mask=global_attn_mask,scalar_value=1/math.sqrt(self.head_dim), keep_prob=1-self.attention_dropout,pre_tokens = 2147483647, next_tokens = 2147483647, inner_precise = 0,drop_mask = None, prefix = None, actual_seq_qlen = None, actual_seq_kvlen = None,sparse_mode=sparse_mode) +``` + + + +--- + +### 3. MoE路由与专家计算优化 + +#### 3.1 Qwen2-MoE: decode优化 + +```python +if routing_weights.shape[0] == 1: + # 遍历激活的 top-k 专家 + final_hidden_states = ops.zeros((batch_size * sequence_length, hidden_dim), dtype=mindspore.float32) + flat_topk_idx = selected_experts.view(-1) + # idt = ops.zeros(1,dtype = mindspore.int64) + for i in range(self.top_k): + expert_idx = flat_topk_idx[i].item() + weight = routing_weights[0, i].to(mindspore.float32) # no item, no precision loss + expert_layer = self.experts[expert_idx] + final_hidden_states += expert_layer(hidden_states).to(mindspore.float32).mul(weight) + final_hidden_states = final_hidden_states.to(hidden_states.dtype) +``` + +#### 3.2 DeepSeek-MoE: decode优化 + +**Decode阶段** + +```python +@no_grad() + def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): + expert_cache = ops.zeros_like(x) + for i in range(self.num_experts_per_tok): + expert_id = flat_expert_indices[i].item() + weight = flat_expert_weights[i].item() + expert = self.experts[expert_id] + expert_out = expert(x) + expert_cache += expert_out * weight + return expert_cache +``` + + + +--- + + +## 最终收益 +| model_name | memory_reserved | memory_allocated | avg_prefill_latency | avg_decode_latency | +| :--- | :--- | :--- | :--- | :--- | +| Qwen1.5-MoE-A2.7B-Chat | 31.138512896 | 29.234176512 | 1.8952324390411377 | 0.14382788760748297 | +| deepseek-moe-16b-chat | 34.359738368 | 32.813018112 | 3.0526745319366455 | 0.18968531806339592 | + + + +--- + +## 关键技术总结 + +1. **算子层优化**:替换mint算子,充分使能昇腾NPU加速计算。 +2. **注意力优化**:集成FlashAttention,加速prefill阶段推理能力。 +3. **MoE优化**:针对decode场景进行优化。 + + + diff --git a/2025-Ascend-Innovation-Contest/S1/MoE/HUST_ASCEND/patches.zip b/2025-Ascend-Innovation-Contest/S1/MoE/HUST_ASCEND/patches.zip new file mode 100644 index 00000000..65d236f0 Binary files /dev/null and b/2025-Ascend-Innovation-Contest/S1/MoE/HUST_ASCEND/patches.zip differ