VTS-V: Multi-step Visual Reasoning with Visual Tokens Scaling and Verification

This repository is the official implementation of Multi-step Visual Reasoning with Visual Tokens Scaling and Verification.

Our paper's homepage is at: https://vts-v.github.io/

Introduction

Requirements

To install requirements:

conda create -n vts_v python=3.10
conda activate vts_v
pip install -r requirements.txt

Quick Start

Step-1: Launch model deployment

You can use closed-source APIs and open-source models to run our method VTS-V.

Closed-source APIs

For closed-source APIs, the system must support OpenAI's chat/completions functionality.

In our experiments, we used GPT-4o to perform reasoning tasks.

If you intend to use GPT-4o as the reasoning model, you need to set the base URL and API key in the test script to your own values.

## Set OpenAI API base-url and api-key here
export OPENAI_BASE_URL=https://api.openai.com/v1 # default value
export OPENAI_API_KEY=your-openai-api-key-here

Open-source Models

For open-source models, they need to be deployed in OpenAI API format using vLLM or SGLang.

In our approach, we employed the Qwen2.5VL-7B-Instruct, Qwen2VL-7B-Instruct, and LLaMA-3.2-11B-Vision-Instruct models, along with our corresponding fine-tuned versions. For the Qwen2.5VL and Qwen2-VL series models, we utilized vLLM for deployment, while for the LLaMA-3.2-Vision series models, we adopted SGLang for deployment.

For vLLM, you can see qwen2vl_28080.sh and qwen25vl_28080.sh as examples.

For SGLang, you can see llama32vision.sh as an example.

In the qwen2vl_28080.sh, qwen25vl_28080.sh and llama32vision.sh scripts, you need to specify MODEL_PATH with your own model path or the corresponding HuggingFace model ID, and specify MODEL_PORT with your chosen port number. Note that these MODEL_PATH and MODEL_PORT values must remain consistent with what you set here when running subsequent test scripts.

# set your own model path
MODEL_PATH=your_model_path_or_name_here
# set your own port
MODEL_PORT=28080

Step-2: Using VTS-V Inference and Evaluation

Supported Benchmark: Currently, our supported benchmarks for testing include BLINK, MathVista, MMStar, and Vstar.

Evaluation Scripts: You can find the scripts to run our method in the eval/BLINK/scripts, eval/MathVista/scripts, eval/MMStar/scripts, eval/Vstar/scripts folders.

Method Modes: Our method operates in three modes: Direct, VTS, and VTS-V.

Direct refers to directly using the reasoning model to test the benchmark.
VTS refers to employing Multi-step Visual Reasoning with Visual Tokens Scaling but without the Verifier.
VTS-V refers to using the complete Multi-step Visual Reasoning with Visual Tokens Scaling and Verification.

In these scripts, different operation modes can be configured by setting corresponding command-line arguments. The specific modes and parameter settings are detailed in the following table:

Mode	using_vts	using_verifier
Direct	False	False
VTS	True	False
VTS-V	True	True

LLM-as-a-judge: When evaluating the benchmark, for cases where the correctness of responses cannot be directly determined programmatically, we employ the LLM-as-a-judge approach. By default, we use Qwen-max as the LLM-as-a-judge evaluation model. You need to specify your own Qwen-Max's base URL, API key, and model name in the script through the following variables:

# default 
export DASHSCOPE_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
export DASHSCOPE_API_KEY=your-dashscope-api-key-here
export DASHSCOPE_MODEL="qwen-max"

Alternatively, you can use any other API that supports the OpenAI chat completion format. You may configure your own API base URL, API key, and model name, but please do not modify these variable names as it may cause errors. For example, if you wish to use GPT-3.5-turbo as the evaluation model, you can set the following parameters:

export DASHSCOPE_BASE_URL=your-base-url-here
export DASHSCOPE_API_KEY=your-api-key-here
export DASHSCOPE_MODEL="gpt-3.5-turbo"

Reasoner Model Type: In our method, we support both closed-source model APIs and open-source models for reasoning tasks. To proceed:

First deploy the model following the instructions in Step-1: Launch model deployment
Then specify the reasoning model type via the reasoner_type parameter in the test script:
- For all closed-source model APIs, set reasoner_type="gpt-4o" (regardless of the actual API used)
- For all open-source models, set reasoner_type="qwen-vl" (regardless of the actual model deployed)

Next, we will use BLINK as an example to demonstrate how to run the three different modes of our method using both GPT-4o and Qwen2.5VL respectively.The execution methods for other benchmarks (Vstar, MMStar, MathVista) are nearly identical to BLINK's and the implementation approaches for other models (Qwen2-VL and LLaMA3.2-Vision) closely mirror those of Qwen2.5VL.

GPT-4o

You should configure your own GPT-4o API base URL and API key in the script.

## Set your own gpt-4o api base-url and api-key here
export OPENAI_BASE_URL=https://api.openai.com/v1 # default value
export OPENAI_API_KEY=your-openai-api-key-here

And you need to specify your own Qwen-Max's base URL, API key, and model name:

export DASHSCOPE_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
export DASHSCOPE_API_KEY=your_dashscope_api_key
export DASHSCOPE_MODEL="qwen-max"

For Direct mode, you can run the script gpt_blink_direct.sh:

cd eval/BLINK/scripts
bash gpt_blink_direct.sh

For VTS mode, you can run the script gpt_blink_vts.sh:

cd eval/BLINK/scripts
bash gpt_blink_vts.sh

For VTS-V mode, you can run the script gpt_blink_vts_v.sh.

In VTS-V mode, you need to specify the verifier model by the following settings in gpt_blink_vts_v.sh:

DPO_MODEL_PATH to the DPO-trained Qwen2.5-VL-7B-Instruct model used in our method
REF_MODEL_PATH to the original Qwen2.5-VL-7B-Instruct model

# set dpo model path and ref model path
DPO_MODEL_PATH=your_own_dpo_model_path_here
REF_MODEL_PATH=your_own_ref_model_path_here

Then run the following command in the terminal:

cd eval/BLINK/scripts
bash gpt_blink_vts_v.sh

Qwen2.5-VL-7B-Instruct

You can use either the original Qwen2.5-VL-7B-Instruct model or our fine-tuned version.

First, you need to launch the Qwen2.5-VL model using vLLM as described in Step-1: Launch model deployment.

Then, you must set the MODEL_PATH and MODEL_PORT parameters in the test scripts to exactly match your vLLM configuration, otherwise there will be runtime errors.

# Set your own model path and port here
MODEL_PATH=your_model_path_here
MODEL_PORT=28080

For Diect mode, you can run the script qwen25vl_blink_direct.sh:

cd eval/BLINK/scripts
bash qwen25vl_blink_direct.sh

For VTS mode, you can run the script qwen25vl_blink_vts.sh:

cd eval/BLINK/scripts
bash qwen25vl_blink_vts.sh

For VTS-V mode, you can run the script qwen25vl_blink_vts_v.sh:

In VTS-V mode, you need to specify the verifier model by the following settings in qwen25vl_blink_vts_v.sh:

DPO_MODEL_PATH to the DPO-trained Qwen2.5-VL-7B-Instruct model used in our method
REF_MODEL_PATH to the original Qwen2.5-VL-7B-Instruct model

# set dpo model path and ref model path
DPO_MODEL_PATH=your_own_dpo_model_path_here
REF_MODEL_PATH=your_own_ref_model_path_here

Then run the following command in the terminal:

cd eval/BLINK/scripts
bash qwen25vl_blink_vts_v.sh

Dataset Construction

In our work, we constructed a 315K SFT dataset and a 301K DPO dataset by sampling from LLaVa-OneVision-Data. For detailed dataset construction methodology, please refer to datagen/README.md

Training

In this work, we conducted supervised fine-tuning (SFT) using our custom-built VTS-SFT dataset on three models: Qwen2.5VL-7B-Instruct, Qwen2VL-7B-Instruct, and LLaMA-3.2-11B-Vision-Instruct. Additionally, we performed DPO training on Qwen2.5VL-7B-Instruct using our constructed VTS-DPO dataset. For detailed methodology, please refer to train/README.md:

Citation

@article{bai2025vtsv,
    title={Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification},
    author={Bai, Tianyi and Hu, Zengjie and Sun, Fupeng and Qiu, Jiantao and Jiang, Yizhen and He, Guangxin and Zeng, Bohan and He, Conghui and Yuan, Binhang and Zhang, Wentao},
    journal={arXiv preprint arXiv:2506.07235},
    year={2025},
    url={https://arxiv.org/abs/2506.07235},
    archivePrefix={arXiv},
    eprint={2506.07235},
    primaryClass={cs.CV},
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
asset		asset
client		client
datagen		datagen
eval		eval
src		src
train		train
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
requirements_backup.txt		requirements_backup.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VTS-V: Multi-step Visual Reasoning with Visual Tokens Scaling and Verification

Introduction

Requirements

Quick Start

Step-1: Launch model deployment

Closed-source APIs

Open-source Models

Step-2: Using VTS-V Inference and Evaluation

GPT-4o

Qwen2.5-VL-7B-Instruct

Dataset Construction

Training

Citation

About

Uh oh!

Releases

Packages

Languages

OpenDCAI/vts-v

Folders and files

Latest commit

History

Repository files navigation

VTS-V: Multi-step Visual Reasoning with Visual Tokens Scaling and Verification

Introduction

Requirements

Quick Start

Step-1: Launch model deployment

Closed-source APIs

Open-source Models

Step-2: Using VTS-V Inference and Evaluation

GPT-4o

Qwen2.5-VL-7B-Instruct

Dataset Construction

Training

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages