[deps] feat: bump veomni pin to 58759e7 (FSDP2 pre-backward hook fix) by Luosuu · Pull Request #24 · verl-project/vexact

Luosuu · 2026-05-06T21:13:19Z

Summary

Bumps the `veomni` pin from `9c041db` to `58759e7`, which is the squash-merge of ByteDance-Seed/VeOmni#731. That PR restores the FSDP2 pre-backward unshard hook on `lm_head` for every MoE / VL architecture that uses VeOmni's chunk_logprobs path.

Why this matters

Without #731, `loss.backward()` on the actor update of any of the affected architectures crashes inside `chunk_logprobs.backward` with:

```
RuntimeError: setStorage: ... storage of size 0
```

Root cause: the patched forward did `output = MoeCausalLMOutputWithPast(...)` followed by `output.log_probs = …; output.entropy = …` after construction. Those mutations were invisible to `ModelOutput`'s pytree flattening, so FSDP2 never registered the pre-backward hook on `lm_head`'s parent unit, and the saved weight reference into the chunk_logprobs custom autograd `Function` stayed sharded by the time backward fired.

The merged squash commit (`58759e7`) contains both the original `qwen3_5_moe` fix (`#731`, from @di.lu) and the cross-arch extension (`ByteDance-Seed/VeOmni#732`) we stacked on top — combined diff covers `qwen3_moe`, `qwen3_5_moe`, `qwen3_vl_moe`, `qwen3_omni_moe`, `qwen2_vl`, `qwen2_5vl`, `qwen3_vl`, and `qwen3_5` (the 8 architectures that shared the post-construction-mutation anti-pattern).

Smoke verification

Both runs on 1×8B200 with the new pin (verl PR #6184 wiring `return_log_probs=True`):

arch	model	step:1 `rollout_probs_diff_max`	step:2 same	`actor/entropy`	actor update
`deepseek_v3`	Moonlight-16B-A3B-Instruct (GRPO + gsm8k)	0.0	0.0	~0.034	✅
`qwen3_moe`	Qwen3-30B-A3B (GRPO + gsm8k, `total_training_steps=2`)	0.0	0.0	~0.124	✅ (8m47s end-to-end, gsm8k acc 0.454)

Same Qwen3-30B-A3B recipe crashed earlier today on `9c041db` with the storage-of-size-0 error in `chunk_logprobs.backward`; runs to completion now.

Why not duplicate

```
gh pr list --repo verl-project/vexact --state open --search "veomni 58759e7 in:title,body"
gh pr list --repo verl-project/vexact --state open --search "fsdp2 in:title,body"
```

Both empty.

Test plan

`uv lock` resolves the new pin: `Updated veomni (dynamic) -> v0.1.9a4 (58759e78)`.
Bitwise rollout-actor alignment + actor update verified end-to-end on 1×8B200 for both `deepseek_v3` (Moonlight) and `qwen3_moe` (Qwen3-30B-A3B). Both show `rollout_probs_diff_max == 0.0` across two consecutive training steps.

AI assistance

This PR was authored with AI assistance (Claude Code). The submitting human reviewed every line and ran the smoke tests above before requesting review.

Picks up ByteDance-Seed/VeOmni#731, which restores the FSDP2 pre-backward unshard hook on ``lm_head`` for every MoE/VL architecture that uses VeOmni's chunk_logprobs path. Without this fix, ``loss.backward()`` on the actor update of qwen3_moe (e.g. Qwen3-30B-A3B), qwen3_5_moe, qwen3_vl_moe, qwen3_omni_moe, qwen2_vl, qwen2_5vl, qwen3_vl, and qwen3_5 crashes inside ``chunk_logprobs.backward`` with ``setStorage … storage of size 0`` — ``output.log_probs = …`` mutations after constructing the model's ``MoeCausalLMOutputWithPast`` / ``CausalLMOutputWithPast`` dropped log_probs and entropy out of ModelOutput's pytree-flattened path, so FSDP2 never registered the pre-backward hook on lm_head's parent unit and the saved weight reference stayed sharded into backward. The merged #731 branch carried both the original qwen3_5_moe fix (from @di.lu) and the cross-arch extension we landed in ByteDance-Seed/VeOmni#732 — the squash commit (58759e7) now contains the unified ``CausalLMOutputWithLogProbs`` / ``MoeCausalLMOutputWithLogProbs`` dataclasses plus 5 model-specific subclasses, with every patched ``*ForCausalLM.forward`` switched to constructor-time field assignment. Smoke verified end-to-end on 1×8B200: * Moonlight-16B-A3B-Instruct (deepseek_v3, GRPO + gsm8k): rollout_probs_diff_max = 0.0 across two consecutive steps, actor/entropy ≈ 0.034 (real chunk_logprobs entropy). * Qwen3-30B-A3B (qwen3_moe, GRPO + gsm8k, total_training_steps=2): rollout_probs_diff_max = 0.0 on both steps, actor/entropy ≈ 0.124, actor update + final eval succeed (training cycle 8m47s, gsm8k acc 0.454).

gemini-code-assist

Code Review

This pull request updates the veomni dependency to a newer revision in both pyproject.toml and uv.lock. Feedback indicates that the uv.lock file may have been manually edited or only partially updated, as the top-level revision hash was not changed; it is recommended to run uv lock to ensure the lockfile is fully synchronized.

gemini-code-assist · 2026-05-06T21:16:48Z

 name = "veomni"
 version = "0.1.9a4"
-source = { git = "https://github.com/ByteDance-Seed/VeOmni.git?rev=9c041dbdac469f372b1b9373078d77771c64638f#9c041dbdac469f372b1b9373078d77771c64638f" }
+source = { git = "https://github.com/ByteDance-Seed/VeOmni.git?rev=58759e78015ad429507079aa443215e3c515364f#58759e78015ad429507079aa443215e3c515364f" }


The uv.lock file revision (at the top of the file) does not appear to have been updated in this PR, despite the changes to the veomni pin in pyproject.toml. This suggests the lockfile might have been manually edited or only partially staged. It is recommended to run uv lock and commit the entire resulting uv.lock file to ensure that the revision hash and all dependency metadata are correctly synchronized.

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

Luosuu merged commit 51ea0ab into main May 6, 2026
2 checks passed

Luosuu deleted the deps/bump-veomni-fsdp2-hook-fix branch May 7, 2026 07:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[deps] feat: bump veomni pin to 58759e7 (FSDP2 pre-backward hook fix)#24

[deps] feat: bump veomni pin to 58759e7 (FSDP2 pre-backward hook fix)#24
Luosuu merged 1 commit into
mainfrom
deps/bump-veomni-fsdp2-hook-fix

Luosuu commented May 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Luosuu commented May 6, 2026

Summary

Why this matters

Smoke verification

Why not duplicate

Test plan

AI assistance

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant