Dairu Liu2*, Ziyue Wang1* 🔰, Minyuan Ruan1*, Fuwen Luo1,
Chi Chen1, Peng Li1 ✉️, Yang Liu1 ✉️
1Dept. of Computer Science & Technology, Institute for AI, Tsinghua University, Beijing, China
2College of Software, Nankai University, Tianjin, China
*Equal contribution
🔰Project lead
✉️Corresponding author

- [TODO: More interesting cases are on the way]
- [26-May-2025] Repo and arxiv released
Images usually convey richer detail than text, but often include redundant information which potentially downgrades multimodal reasoning performance. When faced with lengthy or complex messages, humans tend to employ abstract thinking to convert them into simple and concise abstracts. Inspired by this cognitive strategy, we introduce Visual Abstract Thinking (VAT), a novel thinking paradigm that prompts Multimodal Large Language Models (MLLMs) with visual abstract instead of explicit verbal thoughts or elaborate guidance, permitting a more concentrated visual reasoning mechanism.
Explicit thinking, such as Chain-of-thought (CoT) or tool-augmented approaches, increases the complexity of reasoning process via inserting verbose intermediate steps, external knowledge or visual information. In contrast, VAT reduces redundant visual information and encourages models to focus their reasoning on more essential visual elements.
Experimental results show that VAT consistently empowers different models, and achieves an average gain of 17% over GPT-4o baseline by employing diverse types of visual abstracts, compared with 13% of CoT, demonstrating that VAT better enhances visual reasoning abilities for MLLMs regarding conceptual, structural and relational reasoning tasks. VAT is also compatible with CoT in knowledge-intensive multimodal reasoning tasks. These findings highlight the effectiveness of visual reasoning via abstract thinking and encourage further exploration of more diverse reasoning paradigms from the perspective of human cognition.
We recommend using a conda environment:
conda create -n vat python=3.9
conda activate vat
pip install Pillow opencv-python numpy gradio gradio_client networkx scipy datasets
The img2abs
directory contains two main modules for image abstraction:
open_server.py
: Open-sketch style (port 8086)contour_server.py
: Contour style (port 8085)ani_server.py
: Anime style (port 8084)
To start a service (example for open_server.py):
cd img2abs/informative-drawings
python open_server.py
Replace with the corresponding _server.py
file for other styles.
ps_server.py
: Photo-to-sketch style (port 8083)
To start:
cd img2abs/PhotoSketch
python ps_server.py
All services use Gradio and provide a web interface for image upload and abstraction.
You can download all task data from Feishu Cloud.
After downloading, put all task folders under the ./tasks
directory.
The structure should look like this:
./tasks/
├── blink_spatial/
├── blink_counting/
├── blink_viscorr/
├── blink_semcorr/
├── blink_functional_correspondence/
├── mme_commonsense_reasoning/
├── mme_count/
├── mme_existence/
├── mme_position/
├── mme_color/
├── ooo/
└── ...
The run.sh
script in the project root can be used to batch run various tasks. It will automatically call scripts in subdirectories such as vat
, supporting multiple models and tasks in parallel.
To run:
bash run.sh
The script will launch all configured tasks in the background. Logs and outputs will be saved in the corresponding subdirectories.
You can run a single task with the following command:
python run_task.py --task your_task
Replace your_task
with the name of the task you want to run.
-
MODEL
: Specify the model to use.gpt-4o
: gpt-4o-2024-11-20gemini
: gemini-2.0-pro-exp-02-05
-
ABS_TYPE
: Specify the type of visual abstract. Options are:ps
: PhotoSketchopen
: OpenSketchcontour
: Contour styleanime
: Anime style
Example:
MODEL=gpt-4o ABS_TYPE=open python run_task.py --task ooo
Our VAT framework supports a variety of tasks for visual perception and multimodal reasoning, covering different benchmarks. Below is a list of all available tasks and their corresponding benchmarks:
Task Name | Benchmark |
---|---|
mme_existence | MME |
mme_count | MME |
mme_color | MME |
mme_position | MME |
mme_commonsense_reasoning | MME |
blink_counting | BLINK |
blink_spatial | BLINK |
blink_viscorr | BLINK |
blink_semcorr | BLINK |
blink_functional_correspondence | BLINK |
vd_illusion | HallusionBench |
ooo | Odd-One-Out |
direction | CoSpace |
angle | CoSpace |
dif-ang | CoSpace |
diff | CoSpace |
These tasks cover object-centric perception, object relation reasoning, spatial reasoning, and commonsense reasoning.
Note: For the tasks
blink_viscorr
,blink_semcorr
, andblink_functional_correspondence
, you need to set the environment variableADD_DOT=1
to mark reference points in the visual abstract. For example:ADD_DOT=1 MODEL=gpt-4o ABS_TYPE=open python run_task.py --task blink_viscorr
You can use calc.py
to evaluate the results of your tasks. The basic usage is:
python calc.py <outputs_folder_path> [--accp 0/1] [--abs-type <abstract_type>] [--model <model_name>]
<outputs_folder_path>
: Path to the folder containing the output results to be evaluated.--accp 0/1
: Whether to calculate the acc+ metric for MME tasks (set to 1 for acc+, 0 for standard accuracy).--abs-type <abstract_type>
: Specify the type of abstract to evaluate (e.g.,ps
,open
,contour
,anime
).--model <model_name>
: Specify the model name to be evaluated (e.g.,gpt-4o
,gemini
).
Example:
python calc.py <outputs_folder_path> --accp 1 --abs-type open --model gpt-4o
This will evaluate all results in the <outputs_folder_path>
folder, calculate acc+ for MME tasks, use the open
abstract type, and specify the model as gpt-4o
.
Note:
calc.py
can only be used to evaluate tasks other than the CoSpace benchmark.
If you find our work helpful, please cite us:
@misc{liu2025Visual,
title = {Visual {{Abstract Thinking Empowers Multimodal Reasoning}}},
author = {Liu, Dairu and Wang, Ziyue and Ruan, Minyuan and Luo, Fuwen and Chen, Chi and Li, Peng and Liu, Yang},
year = {2025},
month = may,
number = {arXiv:2505.20164},
eprint = {2505.20164},
primaryclass = {cs},
publisher = {arXiv},
doi = {10.48550/arXiv.2505.20164},
url = {http://arxiv.org/abs/2505.20164},
archiveprefix = {arXiv},
langid = {american},
}
This project integrates and adapts code from the following open-source projects:
For further details, please refer to the README files in each submodule.