Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 137 additions & 0 deletions 2025-Ascend-Innovation-Contest/S1/MoE/HUST_ASCEND/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# MindNLP 模型优化详细说明 (DeepseekMoE & Qwen2-MoE)
## 评测结果

| 评测指标 | 平均得分 |
|---------|---------|
| 峰值显存得分 | 100 |
| Prefill时延得分 | 109.3774 |
| Decode时延得分 | 360.0786 |
| **总分** | **189.8187** |

## 优化模型

本项目针对以下两个MoE(Mixture of Experts)模型进行了昇腾NPU适配与性能优化:

1. **DeepSeek-MoE-16B-Chat** - 深度求索开源的MoE大模型
2. **Qwen1.5-MoE-A2.7B-Chat** - 通义千问开源的MoE大模型

---

## 核心优化技术

### 1. MindSpore算子适配优化

#### 1.1 ops.split代替切片


```python
def rotate_half(x):
"""Rotates half the hidden dims of the input."""
# 优化前
# x1 = x[..., : x.shape[-1] // 2]
# x2 = x[..., x.shape[-1] // 2 :]

# 优化后
x1, x2 = ops.split(x,x.shape[-1]//2,dim=-1)
return ops.cat((-x2, x1), dim=-1)
```


#### 1.2 使用mint.narrow替代切片操作

```python
def forward(self, x, seq_len=None):
# x: [bs, num_attention_heads, seq_len, head_size]
if seq_len > self.max_seq_len_cached:
self._set_cos_sin_cache(seq_len=seq_len, dtype=x.dtype)

return (
# self.cos_cached[:seq_len].to(dtype=x.dtype),
# self.sin_cached[:seq_len].to(dtype=x.dtype),
ops.narrow(self.cos_cached, 0, 0, seq_len).to(dtype=x.dtype),
ops.narrow(self.sin_cached, 0, 0, seq_len).to(dtype=x.dtype),
)
```

**收益**:mint.narrow避免切片操作的额外内存拷贝。

---

### 2. FlashAttention优化

```python
else: # prefill开启
# 融合后,采用mindspore的融合算子,flash_attention_score
sparse_mode = 0
if attention_mask is not None:
attention_mask = ~attention_mask

if self.is_causal:
sparse_mode = 3
global_attn_mask = ops.ones(2048, 2048, dtype=mindspore.bool_).triu(diagonal=1)
attn_output = mindspore.ops.flash_attention_score(query_states, key_states, value_states,
head_num=self.num_heads, input_layout='BNSD', real_shift = None,padding_mask = None, attn_mask=global_attn_mask,scalar_value=1/math.sqrt(self.head_dim), keep_prob=1-self.attention_dropout,pre_tokens = 2147483647, next_tokens = 2147483647, inner_precise = 0,drop_mask = None, prefix = None, actual_seq_qlen = None, actual_seq_kvlen = None,sparse_mode=sparse_mode)
```



---

### 3. MoE路由与专家计算优化

#### 3.1 Qwen2-MoE: decode优化

```python
if routing_weights.shape[0] == 1:
# 遍历激活的 top-k 专家
final_hidden_states = ops.zeros((batch_size * sequence_length, hidden_dim), dtype=mindspore.float32)
flat_topk_idx = selected_experts.view(-1)
# idt = ops.zeros(1,dtype = mindspore.int64)
for i in range(self.top_k):
expert_idx = flat_topk_idx[i].item()
weight = routing_weights[0, i].to(mindspore.float32) # no item, no precision loss
expert_layer = self.experts[expert_idx]
final_hidden_states += expert_layer(hidden_states).to(mindspore.float32).mul(weight)
final_hidden_states = final_hidden_states.to(hidden_states.dtype)
```

#### 3.2 DeepSeek-MoE: decode优化

**Decode阶段**

```python
@no_grad()
def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights):
expert_cache = ops.zeros_like(x)
for i in range(self.num_experts_per_tok):
expert_id = flat_expert_indices[i].item()
weight = flat_expert_weights[i].item()
expert = self.experts[expert_id]
expert_out = expert(x)
expert_cache += expert_out * weight
return expert_cache
```



---


## 最终收益
| model_name | memory_reserved | memory_allocated | avg_prefill_latency | avg_decode_latency |
| :--- | :--- | :--- | :--- | :--- |
| Qwen1.5-MoE-A2.7B-Chat | 31.138512896 | 29.234176512 | 1.8952324390411377 | 0.14382788760748297 |
| deepseek-moe-16b-chat | 34.359738368 | 32.813018112 | 3.0526745319366455 | 0.18968531806339592 |



---

## 关键技术总结

1. **算子层优化**:替换mint算子,充分使能昇腾NPU加速计算。
2. **注意力优化**:集成FlashAttention,加速prefill阶段推理能力。
3. **MoE优化**:针对decode场景进行优化。



Binary file not shown.