Skip to content

[deps] feat: bump veomni pin to 58759e7 (FSDP2 pre-backward hook fix)#24

Merged
Luosuu merged 1 commit into
mainfrom
deps/bump-veomni-fsdp2-hook-fix
May 6, 2026
Merged

[deps] feat: bump veomni pin to 58759e7 (FSDP2 pre-backward hook fix)#24
Luosuu merged 1 commit into
mainfrom
deps/bump-veomni-fsdp2-hook-fix

Conversation

@Luosuu
Copy link
Copy Markdown
Collaborator

@Luosuu Luosuu commented May 6, 2026

Summary

Bumps the `veomni` pin from `9c041db` to `58759e7`, which is the squash-merge of ByteDance-Seed/VeOmni#731. That PR restores the FSDP2 pre-backward unshard hook on `lm_head` for every MoE / VL architecture that uses VeOmni's chunk_logprobs path.

Why this matters

Without #731, `loss.backward()` on the actor update of any of the affected architectures crashes inside `chunk_logprobs.backward` with:

```
RuntimeError: setStorage: ... storage of size 0
```

Root cause: the patched forward did `output = MoeCausalLMOutputWithPast(...)` followed by `output.log_probs = …; output.entropy = …` after construction. Those mutations were invisible to `ModelOutput`'s pytree flattening, so FSDP2 never registered the pre-backward hook on `lm_head`'s parent unit, and the saved weight reference into the chunk_logprobs custom autograd `Function` stayed sharded by the time backward fired.

The merged squash commit (`58759e7`) contains both the original `qwen3_5_moe` fix (`#731`, from @di.lu) and the cross-arch extension (`ByteDance-Seed/VeOmni#732`) we stacked on top — combined diff covers `qwen3_moe`, `qwen3_5_moe`, `qwen3_vl_moe`, `qwen3_omni_moe`, `qwen2_vl`, `qwen2_5vl`, `qwen3_vl`, and `qwen3_5` (the 8 architectures that shared the post-construction-mutation anti-pattern).

Smoke verification

Both runs on 1×8B200 with the new pin (verl PR #6184 wiring `return_log_probs=True`):

arch model step:1 `rollout_probs_diff_max` step:2 same `actor/entropy` actor update
`deepseek_v3` Moonlight-16B-A3B-Instruct (GRPO + gsm8k) 0.0 0.0 ~0.034
`qwen3_moe` Qwen3-30B-A3B (GRPO + gsm8k, `total_training_steps=2`) 0.0 0.0 ~0.124 ✅ (8m47s end-to-end, gsm8k acc 0.454)

Same Qwen3-30B-A3B recipe crashed earlier today on `9c041db` with the storage-of-size-0 error in `chunk_logprobs.backward`; runs to completion now.

Why not duplicate

```
gh pr list --repo verl-project/vexact --state open --search "veomni 58759e7 in:title,body"
gh pr list --repo verl-project/vexact --state open --search "fsdp2 in:title,body"
```

Both empty.

Test plan

  • `uv lock` resolves the new pin: `Updated veomni (dynamic) -> v0.1.9a4 (58759e78)`.
  • Bitwise rollout-actor alignment + actor update verified end-to-end on 1×8B200 for both `deepseek_v3` (Moonlight) and `qwen3_moe` (Qwen3-30B-A3B). Both show `rollout_probs_diff_max == 0.0` across two consecutive training steps.

AI assistance

This PR was authored with AI assistance (Claude Code). The submitting human reviewed every line and ran the smoke tests above before requesting review.

Picks up ByteDance-Seed/VeOmni#731, which restores the FSDP2 pre-backward
unshard hook on ``lm_head`` for every MoE/VL architecture that uses VeOmni's
chunk_logprobs path. Without this fix, ``loss.backward()`` on the actor
update of qwen3_moe (e.g. Qwen3-30B-A3B), qwen3_5_moe, qwen3_vl_moe,
qwen3_omni_moe, qwen2_vl, qwen2_5vl, qwen3_vl, and qwen3_5 crashes inside
``chunk_logprobs.backward`` with ``setStorage … storage of size 0`` —
``output.log_probs = …`` mutations after constructing the model's
``MoeCausalLMOutputWithPast`` / ``CausalLMOutputWithPast`` dropped log_probs
and entropy out of ModelOutput's pytree-flattened path, so FSDP2 never
registered the pre-backward hook on lm_head's parent unit and the saved
weight reference stayed sharded into backward.

The merged #731 branch carried both the original qwen3_5_moe fix (from
@di.lu) and the cross-arch extension we landed in
ByteDance-Seed/VeOmni#732 — the squash commit (58759e7) now contains the
unified ``CausalLMOutputWithLogProbs`` / ``MoeCausalLMOutputWithLogProbs``
dataclasses plus 5 model-specific subclasses, with every patched
``*ForCausalLM.forward`` switched to constructor-time field assignment.

Smoke verified end-to-end on 1×8B200:
* Moonlight-16B-A3B-Instruct (deepseek_v3, GRPO + gsm8k):
  rollout_probs_diff_max = 0.0 across two consecutive steps,
  actor/entropy ≈ 0.034 (real chunk_logprobs entropy).
* Qwen3-30B-A3B (qwen3_moe, GRPO + gsm8k, total_training_steps=2):
  rollout_probs_diff_max = 0.0 on both steps, actor/entropy ≈ 0.124,
  actor update + final eval succeed (training cycle 8m47s, gsm8k acc 0.454).
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the veomni dependency to a newer revision in both pyproject.toml and uv.lock. Feedback indicates that the uv.lock file may have been manually edited or only partially updated, as the top-level revision hash was not changed; it is recommended to run uv lock to ensure the lockfile is fully synchronized.

Comment thread uv.lock
name = "veomni"
version = "0.1.9a4"
source = { git = "https://github.com/ByteDance-Seed/VeOmni.git?rev=9c041dbdac469f372b1b9373078d77771c64638f#9c041dbdac469f372b1b9373078d77771c64638f" }
source = { git = "https://github.com/ByteDance-Seed/VeOmni.git?rev=58759e78015ad429507079aa443215e3c515364f#58759e78015ad429507079aa443215e3c515364f" }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The uv.lock file revision (at the top of the file) does not appear to have been updated in this PR, despite the changes to the veomni pin in pyproject.toml. This suggests the lockfile might have been manually edited or only partially staged. It is recommended to run uv lock and commit the entire resulting uv.lock file to ensure that the revision hash and all dependency metadata are correctly synchronized.

@Luosuu Luosuu merged commit 51ea0ab into main May 6, 2026
2 checks passed
@Luosuu Luosuu deleted the deps/bump-veomni-fsdp2-hook-fix branch May 7, 2026 07:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant