Skip to content

[recipe, diffusion] fix: let Ray set Ascend visible devices in Qwen-Image NPU script#227

Open
Sky-Trigger wants to merge 2 commits into
verl-project:mainfrom
Sky-Trigger:fix-examples-flowgrpo-npu-error
Open

[recipe, diffusion] fix: let Ray set Ascend visible devices in Qwen-Image NPU script#227
Sky-Trigger wants to merge 2 commits into
verl-project:mainfrom
Sky-Trigger:fix-examples-flowgrpo-npu-error

Conversation

@Sky-Trigger

Copy link
Copy Markdown
Contributor

What does this PR do?

This PR removes RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1 from the Qwen-Image OCR LoRA NPU FlowGRPO example script:

examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_lora_npu.sh

With this environment variable set, Ray skips rewriting ASCEND_RT_VISIBLE_DEVICES for Ascend/NPU workers. In this FlowGRPO Qwen-Image NPU recipe, this can lead to device placement mismatch across workers.

Observed error:

AssertionError: Expects tensor to be on the compute device npu:15, was on npu:0

Removing this override allows Ray to manage the per-worker Ascend device visibility normally, which fixes the device mismatch in the NPU FlowGRPO training script.

Why is this needed?

RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1 is useful in some vLLM-Ascend distributed serving setups where device mapping is handled outside Ray. However, for this verl-omni FlowGRPO training recipe, keeping it enabled prevents Ray from setting the expected per-worker device visibility and may cause tensors to be created on the wrong NPU.

This change keeps the example script aligned with Ray-managed NPU placement and avoids the runtime device assertion failure.

Test

Tested by running the Qwen-Image OCR LoRA NPU FlowGRPO example.

Before this change, the script failed with:

AssertionError: Expects tensor to be on the compute device npu:15, was on npu:0

After removing RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1, the script can proceed without this device mismatch error.

Remove export of RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES variable.

Signed-off-by: Trigger <129651635+Sky-Trigger@users.noreply.github.com>
@Sky-Trigger Sky-Trigger requested a review from SamitHuang as a code owner July 2, 2026 08:25

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the environment variable export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1 from the run_qwen_image_ocr_lora_npu.sh script. There are no review comments to evaluate, and I have no additional feedback to provide on this change.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@SamitHuang SamitHuang requested a review from panshaowu July 2, 2026 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant