RAPID: Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning

This repository provides training, inference, and evaluation instructions for RAPID.

🔧 Installation

# Create conda environment
conda create -n rapid python==3.11
conda activate rapid

# Install dependencies
cd verl-main
pip install -r requirements.txt

# Install flash-attention (see: https://github.com/Dao-AILab/flash-attention/releases)
# Follow the instructions for your CUDA version

# Install verl-main
pip install -e .

# Install VLMEvalKit
cd ../VLMEvalKit
pip install -e .

pip install transformers==4.51.1

📦 Prepare Training Data

Download ViRL39K dataset and preprocess it:

python verl-main/examples/data_preprocess/virl39k_pre.py \
  --src-parquet /cache/data/datasets/ViRL39K/39Krelease.parquet \
  --tgt-dir /cache/data/huggingface_datasets/virl39k_hf_no_deepscaler 

python verl-main/examples/data_preprocess/virl39k.py \
  --src-hf-dataset /cache/data/huggingface_datasets/virl39k_hf_no_deepscaler/ \
  --tgt-parquet /cache/data/huggingface_datasets/virl39k_no_deepscaler_caption.parquet

python verl-main/examples/data_preprocess/virl39k_qa.py \
  --src-hf-dataset /cache/data/huggingface_datasets/virl39k_hf_no_deepscaler/ \
  --tgt-parquet /cache/data/huggingface_datasets/virl39k_no_deepscaler_qa.parquet

🏋️‍♂️ Training

Train Qwen2.5-VL-7B with GRPO

bash verl-main/examples/grpo_trainer/grpo_7b.sh

Train the resulting ckeckpoint with VPO (using R1-Distilled-7B as the reasoner)

bash verl-main/examples/grpo_trainer/vpo_7b.sh

Convert Checkpoints to HuggingFace Format

bash verl-main/scripts/convert2hf.sh

🔍 Inference and Evaluation

bash VLMEvalKit/run_inference.sh

📁 Directory Structure

.
├── verl-main/
│   ├── examples/
│   ├── scripts/
│   └── ...
└── VLMEvalKit/
    ├── outputs/
    └── ...

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
VLMEvalKit		VLMEvalKit
verl-main		verl-main
.DS_Store		.DS_Store
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAPID: Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning

🔧 Installation

📦 Prepare Training Data

🏋️‍♂️ Training

Train Qwen2.5-VL-7B with GRPO

Train the resulting ckeckpoint with VPO (using R1-Distilled-7B as the reasoner)

Convert Checkpoints to HuggingFace Format

🔍 Inference and Evaluation

📁 Directory Structure

🤝 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAPID: Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning

🔧 Installation

📦 Prepare Training Data

🏋️‍♂️ Training

Train Qwen2.5-VL-7B with GRPO

Train the resulting ckeckpoint with VPO (using R1-Distilled-7B as the reasoner)

Convert Checkpoints to HuggingFace Format

🔍 Inference and Evaluation

📁 Directory Structure

🤝 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages