Skip to content

We introduce 'Thinking with Video', a new paradigm leveraging video generation for multimodal reasoning. Our VideoThinkBench shows that Sora-2 surpasses GPT5 by 10% on eyeballing puzzles and reaches 69% accuracy on MMMU.

License

Notifications You must be signed in to change notification settings

tongjingqi/Thinking-with-Video

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

49 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

[CVPR 2026] Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

🎊 News

  • [2026.02] πŸ”₯πŸ”₯Our work has been accepted by CVPR 2026! πŸŽ‰πŸŽ‰πŸŽ‰
  • [2025.11] Our paper "Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm" has been released on arXiv! πŸ“„ [Paper] On HuggingFace, it has achieved "#1 Paper of the Day" and also "#1 Paper of the Month"!
  • [2025.11] πŸ”₯We release "minitest" of our VideoThinkBench, including 500 test samples of vision-centric tasks and 250 test samples of text-centric tasks. This "minitest" can save your evaluation cost.
  • [2025.12] πŸ”₯We release VideoThinkBench Leaderboard that includes different models.

πŸ“œ Brief Introduction

Moving beyond the traditional paradigms of "Thinking with Text" (e.g., Chain-of-Thought) and "Thinking with Images", we propose "Thinking with Video"β€”a new paradigm that unifies visual and textual reasoning through video generation models. It naturally enables human-like dynamic reasoning through video generation, such as drawing and imagination.

πŸ’‘ A New Unified Reasoning Paradigm Β Β Β Β "Thinking with Video" leverages video generation models to visualize dynamic processes, represent temporal evolution, and embed text within video frames. This approach achieves unified multimodal understanding and generation, overcoming the static constraints of image-based reasoning and the modality separation in traditional approaches.

πŸ“Š VideoThinkBench: A Comprehensive Benchmark Β Β Β Β We developed VideoThinkBench, the first reasoning benchmark specifically designed for evaluating video generation models. It comprises vision-centric tasks (eyeballing puzzles, visual puzzles, ARC-AGI-2, mazes) that leverage dynamic visual reasoning, and text-centric tasks adapted from established benchmarks (MATH, GSM8K, MMLU, MMMU, etc.) that test text-based reasoning capabilities within generated videos.

πŸš€ Surpassing VLMs on Several Tasks Β Β Β Β Our evaluation shows that Sora-2 demonstrates competitive reasoning capabilities across both categories. Notably, Sora-2 surpasses state-of-the-art vision-language models on several vision-centric tasks, showcasing the unique advantages of dynamic visual reasoning. On text-centric tasks, Sora-2 achieves strong performance including 98.9% on GSM8K, 94.0% on MATH, and 75.5% on MMMU, demonstrating the potential of "Thinking with Video" as a unified multimodal reasoning paradigm.

πŸ“Œ Contents

πŸ› οΈ Installation and Dataset Download

  1. Clone this repository and navigate to Thinking-with-Video folder

    git clone --recursive https://github.com/tongjingqi/Thinking-with-Video.git
    cd Thinking-with-Video
  2. Install dependencies

    conda create -y -n thinking_with_video python==3.12
    conda activate thinking_with_video
    pip install -r requirements.txt
  3. Download benchmark datasets from Hugging Face

    hf download --repo-type dataset OpenMOSS-Team/VideoThinkBench --local-dir VideoThinkBench
    cd VideoThinkBench
    
    # upzip the zip datasets under the `Vision-Centric_Reasoning` and `Text-Centric_Reasoning` folders
    bash unzip_dir.sh Vision-Centric_Reasoning
    bash unzip_dir.sh Text-Centric_Reasoning
    # [Note] you can choose to use the minitest version for evaluation
    # bash unzip_dir.sh minitest_Vision-Centric_Reasoning
    # bash unzip_dir.sh minitest_Text-Centric_Reasoning
    
    # check the statistics of the datasets
    python check.py Vision-Centric_Reasoning > vision_centric_stats.txt
    python check.py Text-Centric_Reasoning > text_centric_stats.txt

πŸ“š VideoThinkBench

VideoThinkBench is a comprehensive benchmark for evaluating video generation models' reasoning capabilities, consisting of two main categories:

Vision-Centric Tasks

  • Eyeballing Puzzles: Spatial reasoning tasks requiring visual estimation and drawing
  • Visual Puzzles: Pattern recognition and visual logic problems
  • ARC-AGI-2: Abstract reasoning tasks requiring few-shot learning
  • Mazes: Path-finding and navigation challenges

Text-Centric Tasks

Adapted from established benchmarks including:

  • Math Reasoning: GSM8K, MATH-500, AIME24, AIME25
  • General Knowledge Reasoning: BBH, MMLU, MMLU-Pro, GPQA-diamond, SuperGPQA-easy
  • Multimodal Math Reasoning: MathVista, MathVision
  • Multimodal Understanding: MMBench, MMMU

Dataset ("minitest"/full test version) is available on Hugging Face.

πŸ’» Code and Evaluation

Vision-Centric Tasks

Text-Centric Tasks

πŸ“ˆ Benchmark Results

Performance Comparison Across All Tasks

The table below summarizes the accuracy (%) of Sora-2 compared with SOTA vision-language models across the tasks in VideoThinkBench (full test):

Category Task Sora-2 Gemini 2.5 Pro GPT5 high Claude Sonnet 4.5
Vision-Centric Eyeballing-Point 44.7 27.8 33.6 36.2
Eyeballing-Line 38.0 21.0 24.0 26.3
Eyeballing-Shape 34.5 34.5 32.5 50.5
Visual-Symmetry 81.9 94.9 98.5 80.1
Visual-Gradient 51.9 83.7 66.7 69.9
Visual-Compositionality 57.5 67.0 85.0 82.0
ARC-AGI-2 1.3 1.9 0.5 5.3
Maze-Square 40.0 0.0 0.0 0.0
Maze-Hexagon 0.0 0.0 0.0 0.0
Maze-Labyrinth 0.0 0.0 0.0 0.0
Average 35.0 33.1 34.1 35.0
Text-Centric Text-Only Math 68.6 94.8 97.2 90.0
Text-Only General Knowledge 65.3 84.5 85.2 86.3
Multimodal Math 61.2 66.7 69.6 65.6
Multimodal General Knowledge 79.1 83.0 80.6 82.3
Average 68.6 82.3 83.2 81.1
Overall Average 44.6 47.1 48.1 48.2

Note: For Sora-2: Eyeballing Puzzles use Major Frame evaluation; Text-Centric Reasoning tasks use Audio evaluation results.

Leaderboard on VideoThinkBench (minitest) (or HERE)

Video Generation Models

# Model Average Eyeballing Point Eyeballing Line Eyeballing Shape Visual Symmetry Visual Gradient Visual Compositionality ARC AGI 2 Maze-Square Maze-Hexagon Maze-Labyrinth
1 Sora 2 31.6 50 35 25 80 35 53 2.8 35.3 0.0 0.0
2 Veo 3.1 27.7 34 24 30 78 40 70 0.7 0.0 0.0 0.0
3 MiniMax Hailuo 2.3 26.0 37 34 28 73 45 43 0.0 0.0 0.0 0.0
4 doubao-seedance-1-0-pro-250528 12.4 22 24 35 25 10 8 0.0 0.0 0.0 0.0
5 Wan2.2-TI2V-5B 7.5 18 10 20 8 10 8 0.7 0.0 0.0 0.0

Image Generation Models

# Model Average Eyeballing Point Eyeballing Line Eyeballing Shape Visual Symmetry Visual Gradient Visual Compositionality ARC-AGI-2 Maze-Square Maze-Hexagon Maze-Labyrinth
1 Nano Banana 2 29.8 24 30 35 85 50 73 0.71 0.0 0.0 0.0
2 Seedream 4.5 24.5 26 16 30 75 35 63 0 0.0 0.0 0.0
3 GPT image 1.5 19.3 24 15 18 38 50 48 0 0.0 0.0 0.0

Vision-Language Models

# Model Average Eyeballing Point Eyeballing Line Eyeballing Shape Visual Symmetry Visual Gradient Visual Compositionality ARC AGI 2 Maze-Square Maze-Hexagon Maze-Labyrinth
1 Claude Sonnet 4.5 37.3 40 34 60 75 75 83 5.7 0.0 0.0 0.0
2 Gemini 2.5 Pro 35.6 33 23 40 95 95 68 2.1 0.0 0.0 0.0
3 GPT5 high 35.5 39 30 23 98 80 85 0.0 0.0 0.0 0.0
4 Qwen3-VL-235B-A22B 30.2 24 17 30 93 55 83 0.0 0.0 0.0 0.0
5 Qwen3-VL-32B 29.6 33 21 20 85 55 78 4.1 0.0 0.0 0.0
6 Qwen3-VL-Plus 29.4 32 29 30 90 35 78 0.0 0.0 0.0 0.0

Note:

  • "Eyeballing Point/Line/Shape" refer to Point Tasks, Line Tasks and Shape Tasks in Eyeballing Puzzles. The results of video generation models are Major Frame evaluation results.
  • "Visual Symmetry/Gradient/Compositionality" refer to the Symmetry Tasks, Gradient Tasks and Compositionality Tasks in Visual Puzzles.

πŸ’‘ Takeaways

Our systematic evaluation on VideoThinkBench reveals seven key findings:

  1. Surpassing VLMs on Eyeballing Puzzles: Sora-2 generally surpasses SOTA VLMs on eyeballing puzzles, exhibiting strong geometric and physical reasoning abilities. It can simulate the extension and reflection of rays and manipulate geometric elements (e.g., points and lines) to support spatial reasoning.

  2. Inductive Reasoning on Visual Puzzles: Sora-2's performance is comparable to Claude Sonnet 4.5 on Shape-Drawing puzzles, demonstrating inductive reasoning capabilities. Sora-2 can recognize and apply patterns of color, shape, and size, solving visual puzzles involving symmetry, gradients, and compositionality.

  3. Few-Shot Learning Capabilities: Sora-2 is a few-shot learner. On ARC-AGI-2, which requires finding patterns in input-output pairs, while SOTA VLMs achieve less than 5% accuracy, Sora-2 can often make reasonable predictions, although they do not strictly match dataset annotations.

  4. Unified Multimodal Reasoning: On text-centric tasks, Sora-2 shows surprising performance on text and multimodal reasoning benchmarks. The video generation model can embed text within video frames, enabling unified multimodal understanding and generation. This demonstrates that "Thinking with Video" is potentially a unified multimodal reasoning paradigm.

  5. Improved In-Context Learning with More Examples: Sora-2 achieves better in-context learning by providing more examples. Experiments show that Sora-2 performs better when provided with all examples compared to only one example, revealing an underexplored direction for analyzing and improving the in-context learning abilities of video generation models.

  6. Test-Time Scaling with Self-Consistency: Self-consistency can improve Sora-2's performance on verifiable video generation reasoning tasks. This reveals an underexplored direction: test-time scaling in video generation reasoning tasks.

  7. Analysis of Capability Source: We systematically analyzed the source of Sora-2's capabilities. Sora-2 maintains performance comparable to the original test set on adapted math problems, reducing the likelihood of test set leakage. However, Sora-2 struggles to generate coherent reasoning processes in videos, even when providing correct final answers. Through comparative experiments with Wan 2.5, we speculate that Sora-2's text-centric reasoning ability originates from its prompt rewriter model.

βš–οΈ Licenses

Code License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ”Ž Citation

If you find our work helpful, please consider citing our paper πŸ“ and starring us ⭐️!

@article{tong2025thinking,
  title={Thinking with video: Video generation as a promising multimodal reasoning paradigm},
  author={Tong, Jingqi and Mou, Yurong and Li, Hangcheng and Li, Mingzhe and Yang, Yongzhuo and Zhang, Ming and Chen, Qiguang and Liang, Tianyi and Hu, Xiaomeng and Zheng, Yining and others},
  journal={arXiv preprint arXiv:2511.04570},
  year={2025}
}

⭐ Star History

Star History Chart


Made with ❀️ for advancing multimodal reasoning research

About

We introduce 'Thinking with Video', a new paradigm leveraging video generation for multimodal reasoning. Our VideoThinkBench shows that Sora-2 surpasses GPT5 by 10% on eyeballing puzzles and reaches 69% accuracy on MMMU.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors