Skip to content

verl-project/vexact

Repository files navigation

VeXact

Transformer-based bitwise-aligned rollout for VeOmni FSDP with VeRL integration.

VeXact is our zero-mismatch rollout engine for LLM reinforcement learning. See our paper Diagnosing Training-Inference Mismatch in LLM Reinforcement Learning for its use as a TIM-free diagnostic baseline.

Key Features

  • 🎯 Bitwise-aligned training & inference — VeOmni FSDP actor and VeXact rollout engine produce identical logprobs for dense and MoE models with verl (the legacy FSDP engine is not supported for MoE models).
    • All the dense model should work out-of-the-box if they are not using ops that are different between training and inference like linear attention.
    • MoE models need to patch the model with Fused MoE kernel like our Qwen3-MoE and DeepSeek-V3 example.
  • Fast and aligned kernels — Fused MoE, fused linear cross-entropy, Flash Attention 3/4 with paged KV cache, all numerically consistent between training and inference
  • 🧩 Simple model definitions — Transformer model code is self-contained and easy to audit, so training and inference model definitions stay in sync
  • 📖 Readable codebase — Clean implementation with chunked prefill, pipeline parallelism, and CUDA graph support

Effectiveness

Qwen3-30B-A3B · REINFORCE · DAPO dataset

Off-policy logprob bias from vLLM causes the rollout-correction KL to explode after ~300 steps, which triggers gradient norm blow-up and ultimately training collapse. VeXact's bitwise-aligned rollout keeps the KL at exactly zero throughout, yielding stable training and a ~2× higher final AIME 2024 score.

Training reward AIME 2024 (mean@32)
Rollout-correction K3 KL (log scale) Gradient norm (log scale)

Example Recipes

End-to-end RL training scripts live under examples/. Run any script from the repo root:

bash examples/getting_started/run_qwen3_1b7.sh
# override paths via env vars
model_dir=/path/to/model data_dir=/path/to/data bash examples/moe/run_qwen3_30B_A3B_dapo.sh
Recipe Model Dataset Hardware Algorithm
getting_started/run_qwen3_1b7.sh Qwen3-1.7B gsm8k 1×8H100 GRPO
moe/run_qwen3_30B_A3B_dapo.sh Qwen3-30B-A3B DAPO-Math-17k / AIME 2025 1×8H100 DAPO
moe/run_qwen3_30B_A3B_reinforce.sh Qwen3-30B-A3B-Base DAPO-Math-17k / AIME 2024 8×8H100 REINFORCE
moe/run_qwen3_30B_A3B_16H100.sh Qwen3-30B-A3B gsm8k 2×8H100 GRPO
moe/run_qwen3_30B_A3B_8B200.sh Qwen3-30B-A3B gsm8k 1×8B200 GRPO
moe/run_moonlight_gsm8k.sh Moonlight-16B-A3B-Instruct gsm8k 1×8B200 GRPO
moe/run_moonlight_reinforce.sh Moonlight-16B-A3B-Instruct DAPO-Math-17k / AIME 2024 1×8B200 REINFORCE
verify/run_dense_vexact.sh DeepSeek-R1-Distill-Qwen-1.5B MATH / AIME 2024+2025 1×8H100 GRPO (vexact)
verify/run_dense_vllm.sh DeepSeek-R1-Distill-Qwen-1.5B MATH / AIME 2024+2025 1×8H100 GRPO (vllm)

See examples/README.md for path configuration, attention backend selection, and an explanation of the verify/ pair.

Installation

VeXact uses uv for environment management. Pick the extras that match your use case:

# End-to-end RL training (verl trainer + VeOmni FSDP actor + VeXact rollout):
uv sync --extra gpu --extra verl --extra veomni

# Rollout-only (no trainer, no FSDP actor):
uv sync --extra gpu

# Add the dev extra (pytest, pre-commit) when contributing:
uv sync --extra gpu --extra verl --extra veomni --extra dev

What each extra does:

  • gpu — PyTorch (CUDA 12.9), FlashAttention 2/3/4, quack-kernels, NVML.
  • verl — pulls verl from verl-project/verl (pinned by commit in [tool.uv.sources]) plus FastAPI/uvicorn/cachetools used by the trainer.
  • veomni — pulls VeOmni from ByteDance-Seed/VeOmni (pinned by commit).
  • vllm — vLLM 0.18 if you prefer it as the rollout engine instead of VeXact's native one.
  • devpytest, pytest-asyncio, pre-commit for development.

Working on verl or VeOmni locally

verl and veomni are pinned by git commit in pyproject.toml's [tool.uv.sources] block, so contributors and CI all resolve to the same upstream. To hack on either upstream against your local checkout, swap the relevant entry to editable = true (the file has inline hints):

[tool.uv.sources]
verl = { path = "./verl", editable = true }
veomni = { path = "./VeOmni", editable = true }

Then uv sync --extra gpu --extra verl --extra veomni re-resolves the venv to your local tree.

Components

Contribution Guide

See contributions guide.

Acknowledgements

Besides VeRL and VeOmni, VeXact builds on and is inspired by the following projects:

  • vLLM — We refer to vLLM model runner-v2 design and reuse its sampler.
  • batch_invariant_ops — Batch-invariant operators for deterministic inference
  • Torch Memory Saver - Model param and KV cache offloads.
  • FlashAttention - We support FA4 for SM90+ (including SM100) GPU, including MLA shape for DeepSeek-V3 model architecture.

Citation

If you find our work useful, please consider citing our paper:

@article{zhong2026diagnosing,
  title={Diagnosing Training Inference Mismatch in LLM Reinforcement Learning},
  author={Zhong, Tianle and Ling, Neiwen and Pi, Yifan and Wei, Zijun and Yu, Tianshu and Fox, Geoffrey and Wu, Peng and Yu, Xiao},
  journal={arXiv preprint arXiv:2605.14220},
  year={2026}
}

About

verl Zero-Mismatch Dense/MoE HuggingFace Rollout

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors