Skip to content

[doc] feat: Add Qwen3vl-8B NPU Optimization Practice#5873

Draft
Rhetee wants to merge 3 commits intoverl-project:mainfrom
Rhetee:main
Draft

[doc] feat: Add Qwen3vl-8B NPU Optimization Practice#5873
Rhetee wants to merge 3 commits intoverl-project:mainfrom
Rhetee:main

Conversation

@Rhetee
Copy link
Copy Markdown
Contributor

@Rhetee Rhetee commented Apr 3, 2026

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

This PR updates the Qwen3vl-8B NPU Optimization Practice, developers can refer to this doc for help.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive tutorial for optimizing Qwen3vl-8B GRPO training and inference on Ascend NPU platforms, covering performance profiling, operator fusion, and scheduling optimizations. The review feedback identifies several technical inaccuracies in the documentation's code and configuration snippets, including mismatched function names that would cause runtime errors, invalid YAML syntax for dynamic batch size calculations, and incomplete Python function examples lacking necessary variable definitions.

Comment on lines +117 to +120
modeling_qwen3_vl_moe.Qwen3VLMoeTextRMSNorm.forward = rms_norm_forward
modeling_qwen3_vl_moe.apply_rotary_pos_emb = apply_rotary_pos_emb_qwen3_npu
modeling_qwen3_vl.Qwen3VLTextRMSNorm.forward = rms_norm_forward
modeling_qwen3_vl.Qwen3VLTextMLP.forward = silu_forward
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

文档中的代码片段引用的函数名与 verl/models/transformers/npu_patch.py 中的实际定义不一致。例如,文档中使用了 rms_norm_forwardapply_rotary_pos_emb_qwen3_npu,而实际代码中定义的是 rms_norm_forward_npuapply_rotary_pos_emb_npu。这会导致用户在手动参考或注入逻辑时遇到 NameError

Suggested change
modeling_qwen3_vl_moe.Qwen3VLMoeTextRMSNorm.forward = rms_norm_forward
modeling_qwen3_vl_moe.apply_rotary_pos_emb = apply_rotary_pos_emb_qwen3_npu
modeling_qwen3_vl.Qwen3VLTextRMSNorm.forward = rms_norm_forward
modeling_qwen3_vl.Qwen3VLTextMLP.forward = silu_forward
modeling_qwen3_vl_moe.Qwen3VLMoeTextRMSNorm.forward = rms_norm_forward_npu
modeling_qwen3_vl_moe.apply_rotary_pos_emb = apply_rotary_pos_emb_npu
modeling_qwen3_vl.Qwen3VLTextRMSNorm.forward = rms_norm_forward_npu
modeling_qwen3_vl.Qwen3VLTextMLP.forward = silu_forward_npu

Comment on lines +148 to +150
use_dynamic_bsz: true
ppo_max_token_len_per_gpu: 2 * (max_prompt_len + max_response_len)
log_prob_max_token_len_per_gpu: 4 * (max_prompt_len + max_response_len)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

YAML 配置文件不支持直接在值中使用算术表达式(如 2 * (...)),除非使用了特定的解析器(如 OmegaConf 的自定义 resolver)。此外,use_dynamic_bsz 等配置项在当前的 verl/trainer/config/ppo_trainer.yaml 中并未定义。如果这些是新增功能,请确保配置模板同步更新;否则,请在文档中使用具体的数值示例以避免误导用户。

Comment on lines +194 to +214
def forward_native():
if is_first_layer:
cos_sin = self.cos_sin_cache[positions]
cos, sin = cos_sin.chunk(2, dim=-1)
if self.mrope_interleaved:
cos = apply_interleaved_rope(cos, self.mrope_section)
sin = apply_interleaved_rope(sin, self.mrope_section)
cos = cos.repeat(1, 2)
sin = sin.repeat(1, 2)
self.cos = cos.unsqueeze(0).unsqueeze(-2).contiguous()
self.sin = sin.unsqueeze(0).unsqueeze(-2).contiguous()
forward_context.is_first_layer = False

query_shape = query.shape
query = query.view(num_tokens, -1, self.head_size)
query_rot = query[..., :self.rotary_dim]
query_pass = query[..., self.rotary_dim:]
query_rot = query_rot.unsqueeze(0)
query_rot = torch_npu.npu_rotary_mul(query_rot, self.cos, self.sin, "half").squeeze(0)
query = torch.cat((query_rot, query_pass), dim=-1).reshape(query_shape)
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

forward_native 代码片段作为技术参考不够严谨。变量 positionsquerynum_tokensis_first_layer 在函数作用域内均未定义或作为参数传入。建议补充完整的函数签名或必要的上下文初始化逻辑,以确保示例代码的正确性和可参考性。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant