SimpleVLA-RL is an efficient RL framework for VLA that improves long-horizon planning under data scarcity. It leverages reinforcement learning that can substantially outperforms SFT in simulation and real-world tasks, reveals a "pushcut" new-action phenomenon, and strengthens spatial/object/goal generalization.
- [2025-10-01] SimpleVLA-RL now supports RoboTwin2.0 Benchmark. Feel free to experiment with it!
 - [2025-09-12] Excited to release the SimpleVLA-RL paper! Check it out: Paper.
 - [2025-05-27] We release the code of SimpleVLA-RL.
 
- End-to-end VLA RL pipeline built on veRL with VLA-specific optimizations
 - Multi-environment parallel rendering significantly accelerates VLA trajectory sampling
 - Leverages veRL's state-of-the-art infrastructure: efficient distributed training (FSDP), hybrid communication patterns, and optimized memory management for fast training/inference
 
- VLA Models: OpenVLA, OpenVLA-OFT
 - Benchmarks: LIBERO, RoboTwin 1.0/2.0
 - Modular architecture for easy integration of new VLA models, benchmarks and RL algorithms (Upcoming)
 
- Binary (0/1) outcome rewards - no complex reward design needed
 - Exploration strategies: dynamic sampling, adaptive clipping, temperature tuning
 
SimpleVLA-RL extends veRL with VLA-specific components across the following modules:
- Main entry point with ray initialization
 RobRewardManagerfor reward distribution
verl/trainer/ppo/ray_trainer.py
- Main RL training loop: data loading, VLA rollout, model updates, evaluation, checkpointing
 - RL algorithm-specific advantage computation
 
- Source of core functions called in 
ray_trainer.py - VLA model/optimizer initialization, 
generate_sequences,compute_entropy,update_actor 
- Specific implementation of functions in 
fsdp_workers.py - RL loss computation, policy updates, 
compute_log_prob,compute_entropy 
verl/workers/rollout/rob_rollout.py
- VLA rollout implementation: environment creation, multi-environment parallel rendering, VLA action generation, environment interaction, video saving, trajectory and 0/1 reward collection
 
verl/utils/dataset/rob_dataset.py
- Dataset construction for training/testing across benchmarks
 
- VLA model implementations (OpenVLA-OFT/OpenVLA from official code)
 
See SETUP.md for detailed instructions on setting up the conda environment.
An SFT (Supervised Fine-Tuning) VLA model is required for RL training. Below are the available options:
- 
OpenVLA-OFT SFT Models
Download from the SimpleVLA-RL Collection. Available models include:libero-10 traj1/trajall SFTlibero-goal/object/spatial traj1 SFTRobotwin2.0 tasks traj1000 SFT
 - 
OpenVLA SFT Models
Download from here. - 
Other Models
For other models, you may need to fine-tune them yourself. 
Before running the training script, ensure the following configurations are properly set:
- 
Set Your Weights and Biases (WandB) API Key
Replace theWANDB_API_KEYfield inSimpleVLA-RL/align.jsonwith your own WandB API key. - 
Modify Key Variables
Update the following variables inexamples/run_openvla_oft_rl_libero/twin2.shas needed:WANDB_API_KEY: Your WandB API key.EXPERIMENT_NAME: The name of your experiment. You can choose any name.SFT_MODEL_PATH: Path to your SFT model.CKPT_PATH: Path where your checkpoints will be saved.DATASET_NAME: For detailed options, refer toexamples/run_openvla_oft_rl_libero/twin2.sh.ALIGN_PATH: Path to theSimpleVLA-RL/align.jsonfile.NUM_GPUS: Number of GPUs available per node (e.g.,8).NUM_NODES: Number of nodes used for RL training (e.g.,1).
 
Note
- The script has been tested on the following configurations:
- Single-node setup: 
NUM_NODES=1,NUM_GPUS=8(1 node with 8 NVIDIA A800 GPUs, each having 80GB memory). - Multi-node setup: 
NUM_NODES=2,NUM_GPUS=8(2 nodes with 16 NVIDIA A800 GPUs, each having 80GB memory). 
 - Single-node setup: 
 - The driver version used is 
470.161.03, and the CUDA version is12.4. (Not necessary) 
- 
Run RL Training
Use the following command to start RL training for OpenVLA-OFT on the LIBERO or RoboTwin2.0 benchmark:bash examples/run_openvla_oft_rl_libero.sh or bash examples/run_openvla_oft_rl_twin2.sh
 
To evaluate the performance of your model, enable evaluation mode by setting trainer.val_only=True in examples/run_openvla_oft_rl_libero/twin2.sh. Then, execute the same script:
bash examples/run_openvla_oft_rl_libero.sh
or
bash examples/run_openvla_oft_rl_twin2.shWe evaluate SimpleVLA-RL on the LIBERO using OpenVLA-OFT. SimpleVLA-RL improves the performance of OpenVLA-OFT to 97.6 points on LIBERO-Long and sets a new state-of-the-art. Remarkably, using only one trajectory per task for cold-start SFT, SimpleVLA-RL raises the performance of OpenVLA-OFT from 17.3 to 91.7, yielding an improvement of 74.4 points (430.1%).
We develop this preview version of the code based on veRL, OpenVLA-OFT, RoboTwin2.0, and PRIME. We acknowledge their significant contributions! For further details and updates, please refer to the official documentation and repositories of the respective projects.
- Support advanced diffusion based RL: pi0 and pi0.5 with flow matching RL
 - Support more VLA models: especially for lightweight ones (e.g. VLA-Adapter, SmolVLA)
 
- Support more benchmarks: e.g. SimplerEnv, BEHAVIOR, Calvin
 - Support real-world RL.
 
- Additional online RL methods and Offline RL algorithms
 - Modular environment and VLA interface for easy adaptation
 - Further optimize the RL framework to achieve more efficient training
 
- Haozhan Li: [email protected]
 - Ning Ding: [email protected]
 
If you find SimpleVLA-RL helpful, please cite us:
@article{li2025simplevla,
  title={SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning},
  author={Li, Haozhan and Zuo, Yuxin and Yu, Jiale and Zhang, Yuhao and Yang, Zhaohui and Zhang, Kaiyan and Zhu, Xuekai and Zhang, Yuchen and Chen, Tianxing and Cui, Ganqu and others},
  journal={arXiv preprint arXiv:2509.09674},
  year={2025}
}

