[deps] feat: bump veomni pin to 58759e7 (FSDP2 pre-backward hook fix)#24
Conversation
Picks up ByteDance-Seed/VeOmni#731, which restores the FSDP2 pre-backward unshard hook on ``lm_head`` for every MoE/VL architecture that uses VeOmni's chunk_logprobs path. Without this fix, ``loss.backward()`` on the actor update of qwen3_moe (e.g. Qwen3-30B-A3B), qwen3_5_moe, qwen3_vl_moe, qwen3_omni_moe, qwen2_vl, qwen2_5vl, qwen3_vl, and qwen3_5 crashes inside ``chunk_logprobs.backward`` with ``setStorage … storage of size 0`` — ``output.log_probs = …`` mutations after constructing the model's ``MoeCausalLMOutputWithPast`` / ``CausalLMOutputWithPast`` dropped log_probs and entropy out of ModelOutput's pytree-flattened path, so FSDP2 never registered the pre-backward hook on lm_head's parent unit and the saved weight reference stayed sharded into backward. The merged #731 branch carried both the original qwen3_5_moe fix (from @di.lu) and the cross-arch extension we landed in ByteDance-Seed/VeOmni#732 — the squash commit (58759e7) now contains the unified ``CausalLMOutputWithLogProbs`` / ``MoeCausalLMOutputWithLogProbs`` dataclasses plus 5 model-specific subclasses, with every patched ``*ForCausalLM.forward`` switched to constructor-time field assignment. Smoke verified end-to-end on 1×8B200: * Moonlight-16B-A3B-Instruct (deepseek_v3, GRPO + gsm8k): rollout_probs_diff_max = 0.0 across two consecutive steps, actor/entropy ≈ 0.034 (real chunk_logprobs entropy). * Qwen3-30B-A3B (qwen3_moe, GRPO + gsm8k, total_training_steps=2): rollout_probs_diff_max = 0.0 on both steps, actor/entropy ≈ 0.124, actor update + final eval succeed (training cycle 8m47s, gsm8k acc 0.454).
There was a problem hiding this comment.
Code Review
This pull request updates the veomni dependency to a newer revision in both pyproject.toml and uv.lock. Feedback indicates that the uv.lock file may have been manually edited or only partially updated, as the top-level revision hash was not changed; it is recommended to run uv lock to ensure the lockfile is fully synchronized.
| name = "veomni" | ||
| version = "0.1.9a4" | ||
| source = { git = "https://github.com/ByteDance-Seed/VeOmni.git?rev=9c041dbdac469f372b1b9373078d77771c64638f#9c041dbdac469f372b1b9373078d77771c64638f" } | ||
| source = { git = "https://github.com/ByteDance-Seed/VeOmni.git?rev=58759e78015ad429507079aa443215e3c515364f#58759e78015ad429507079aa443215e3c515364f" } |
There was a problem hiding this comment.
The uv.lock file revision (at the top of the file) does not appear to have been updated in this PR, despite the changes to the veomni pin in pyproject.toml. This suggests the lockfile might have been manually edited or only partially staged. It is recommended to run uv lock and commit the entire resulting uv.lock file to ensure that the revision hash and all dependency metadata are correctly synchronized.
Summary
Bumps the `veomni` pin from `9c041db` to `58759e7`, which is the squash-merge of ByteDance-Seed/VeOmni#731. That PR restores the FSDP2 pre-backward unshard hook on `lm_head` for every MoE / VL architecture that uses VeOmni's chunk_logprobs path.
Why this matters
Without #731, `loss.backward()` on the actor update of any of the affected architectures crashes inside `chunk_logprobs.backward` with:
```
RuntimeError: setStorage: ... storage of size 0
```
Root cause: the patched forward did `output = MoeCausalLMOutputWithPast(...)` followed by `output.log_probs = …; output.entropy = …` after construction. Those mutations were invisible to `ModelOutput`'s pytree flattening, so FSDP2 never registered the pre-backward hook on `lm_head`'s parent unit, and the saved weight reference into the chunk_logprobs custom autograd `Function` stayed sharded by the time backward fired.
The merged squash commit (`58759e7`) contains both the original `qwen3_5_moe` fix (`#731`, from @di.lu) and the cross-arch extension (`ByteDance-Seed/VeOmni#732`) we stacked on top — combined diff covers `qwen3_moe`, `qwen3_5_moe`, `qwen3_vl_moe`, `qwen3_omni_moe`, `qwen2_vl`, `qwen2_5vl`, `qwen3_vl`, and `qwen3_5` (the 8 architectures that shared the post-construction-mutation anti-pattern).
Smoke verification
Both runs on 1×8B200 with the new pin (verl PR #6184 wiring `return_log_probs=True`):
Same Qwen3-30B-A3B recipe crashed earlier today on `9c041db` with the storage-of-size-0 error in `chunk_logprobs.backward`; runs to completion now.
Why not duplicate
```
gh pr list --repo verl-project/vexact --state open --search "veomni 58759e7 in:title,body"
gh pr list --repo verl-project/vexact --state open --search "fsdp2 in:title,body"
```
Both empty.
Test plan
AI assistance
This PR was authored with AI assistance (Claude Code). The submitting human reviewed every line and ran the smoke tests above before requesting review.