Skip to content

PRIME-RL/SimpleVLA-RL

Repository files navigation

SimpleVLA-RL: Open RL Framework for Vision–Language–Action Models

Paper Github Hugging Face Collection Twitter WeChat

SimpleVLA-RL is an efficient RL framework for VLA that improves long-horizon planning under data scarcity. It leverages reinforcement learning that can substantially outperforms SFT in simulation and real-world tasks, reveals a "pushcut" new-action phenomenon, and strengthens spatial/object/goal generalization.

🎉News

  • [2025-10-01] SimpleVLA-RL now supports RoboTwin2.0 Benchmark. Feel free to experiment with it!
  • [2025-09-12] Excited to release the SimpleVLA-RL paper! Check it out: Paper.
  • [2025-05-27] We release the code of SimpleVLA-RL.

📌Highlights

Efficient and Effective VLA Reinforcement Learning Framework

  • End-to-end VLA RL pipeline built on veRL with VLA-specific optimizations
  • Multi-environment parallel rendering significantly accelerates VLA trajectory sampling
  • Leverages veRL's state-of-the-art infrastructure: efficient distributed training (FSDP), hybrid communication patterns, and optimized memory management for fast training/inference

Model and Environment Support

Minimal Reward Engineering and Exploration Strategies

  • Binary (0/1) outcome rewards - no complex reward design needed
  • Exploration strategies: dynamic sampling, adaptive clipping, temperature tuning

🔧Key Implementations

SimpleVLA-RL extends veRL with VLA-specific components across the following modules:

verl/trainer/main_ppo.py

  • Main entry point with ray initialization
  • RobRewardManager for reward distribution

verl/trainer/ppo/ray_trainer.py

  • Main RL training loop: data loading, VLA rollout, model updates, evaluation, checkpointing
  • RL algorithm-specific advantage computation

verl/workers/fsdp_workers.py

  • Source of core functions called in ray_trainer.py
  • VLA model/optimizer initialization, generate_sequences, compute_entropy, update_actor

verl/workers/actor/dp_rob.py

  • Specific implementation of functions in fsdp_workers.py
  • RL loss computation, policy updates, compute_log_prob, compute_entropy

verl/workers/rollout/rob_rollout.py

  • VLA rollout implementation: environment creation, multi-environment parallel rendering, VLA action generation, environment interaction, video saving, trajectory and 0/1 reward collection

verl/utils/dataset/rob_dataset.py

  • Dataset construction for training/testing across benchmarks

verl/utils/vla_utils/

  • VLA model implementations (OpenVLA-OFT/OpenVLA from official code)

✨Getting Started

1. Set Up the Environment

See SETUP.md for detailed instructions on setting up the conda environment.

2. Prepare the SFT Model

An SFT (Supervised Fine-Tuning) VLA model is required for RL training. Below are the available options:

  • OpenVLA-OFT SFT Models
    Download from the SimpleVLA-RL Collection. Available models include:

    • libero-10 traj1/trajall SFT
    • libero-goal/object/spatial traj1 SFT
    • Robotwin2.0 tasks traj1000 SFT
  • OpenVLA SFT Models
    Download from here.

  • Other Models
    For other models, you may need to fine-tune them yourself.

3. Train with SimpleVLA-RL

Before running the training script, ensure the following configurations are properly set:

  • Set Your Weights and Biases (WandB) API Key
    Replace the WANDB_API_KEY field in SimpleVLA-RL/align.json with your own WandB API key.

  • Modify Key Variables
    Update the following variables in examples/run_openvla_oft_rl_libero/twin2.sh as needed:

    • WANDB_API_KEY: Your WandB API key.
    • EXPERIMENT_NAME: The name of your experiment. You can choose any name.
    • SFT_MODEL_PATH: Path to your SFT model.
    • CKPT_PATH: Path where your checkpoints will be saved.
    • DATASET_NAME: For detailed options, refer to examples/run_openvla_oft_rl_libero/twin2.sh.
    • ALIGN_PATH: Path to the SimpleVLA-RL/align.json file.
    • NUM_GPUS: Number of GPUs available per node (e.g., 8).
    • NUM_NODES: Number of nodes used for RL training (e.g., 1).

Note

  • The script has been tested on the following configurations:
    • Single-node setup: NUM_NODES=1, NUM_GPUS=8 (1 node with 8 NVIDIA A800 GPUs, each having 80GB memory).
    • Multi-node setup: NUM_NODES=2, NUM_GPUS=8 (2 nodes with 16 NVIDIA A800 GPUs, each having 80GB memory).
  • The driver version used is 470.161.03, and the CUDA version is 12.4. (Not necessary)
  • Run RL Training
    Use the following command to start RL training for OpenVLA-OFT on the LIBERO or RoboTwin2.0 benchmark:

    bash examples/run_openvla_oft_rl_libero.sh
    or
    bash examples/run_openvla_oft_rl_twin2.sh

4. Run Evaluation

To evaluate the performance of your model, enable evaluation mode by setting trainer.val_only=True in examples/run_openvla_oft_rl_libero/twin2.sh. Then, execute the same script:

bash examples/run_openvla_oft_rl_libero.sh
or
bash examples/run_openvla_oft_rl_twin2.sh

📃 Main Results

We evaluate SimpleVLA-RL on the LIBERO using OpenVLA-OFT. SimpleVLA-RL improves the performance of OpenVLA-OFT to 97.6 points on LIBERO-Long and sets a new state-of-the-art. Remarkably, using only one trajectory per task for cold-start SFT, SimpleVLA-RL raises the performance of OpenVLA-OFT from 17.3 to 91.7, yielding an improvement of 74.4 points (430.1%).

Main Results of SimpleVLA-RL.
Overview of SimpleVLA-RL.

🌻Acknowledgement

We develop this preview version of the code based on veRL, OpenVLA-OFT, RoboTwin2.0, and PRIME. We acknowledge their significant contributions! For further details and updates, please refer to the official documentation and repositories of the respective projects.

📝Roadmap

Expanding Model Support

  • Support advanced diffusion based RL: pi0 and pi0.5 with flow matching RL
  • Support more VLA models: especially for lightweight ones (e.g. VLA-Adapter, SmolVLA)

Expanding Environment Support

Expanding Framework

  • Additional online RL methods and Offline RL algorithms
  • Modular environment and VLA interface for easy adaptation
  • Further optimize the RL framework to achieve more efficient training

📨Contact

🎈Citation

If you find SimpleVLA-RL helpful, please cite us:

@article{li2025simplevla,
  title={SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning},
  author={Li, Haozhan and Zuo, Yuxin and Yu, Jiale and Zhang, Yuhao and Yang, Zhaohui and Zhang, Kaiyan and Zhu, Xuekai and Zhang, Yuchen and Chen, Tianxing and Cui, Ganqu and others},
  journal={arXiv preprint arXiv:2509.09674},
  year={2025}
}

🌟Star History

Star History Chart