Skip to content

[EMNLP 2025 main πŸ”₯] Code for "Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More"

License

Notifications You must be signed in to change notification settings

ZichenWen1/DART

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

46 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More

Zichen Wen1,2, Yifeng Gao1, Shaobo Wang1, Junyuan Zhang2, Qintong Zhang2,4,
Weijia Li3,2, Conghui He2βœ‰, Linfeng Zhang1βœ‰,

1Shanghai Jiao Tong University, 2Shanghai AI Laboratory,
3Sun Yat-sen University, 4Peking University

arXiv GitHub issues GitHub Stars

πŸ”₯ News

  • 2025.10.13 πŸ€—πŸ€— We have released our latest work EPIC, an efficient framework for progressive consistency distillation in multimodal large language models!
  • 2025.10.10 πŸ€—πŸ€— We've released our latest work, VTC-Bench. Come test whether your token compression method really works!
  • 2025.08.30 πŸ€—πŸ€— We have seamlessly integrated DART into Qwen2.5-VL.
  • 2025.08.21 πŸ€—πŸ€— Our DART is accepted at EMNLP'25 main!
  • 2025.05.15 πŸ€—πŸ€— Our analytical work on token compression has been accepted as ACL'25 Finding!
  • 2025.03.19 πŸ€—πŸ€— The implementation and evaluation scripts for LLaVA-Next are now available
  • 2025.03.18 πŸ€—πŸ€— We have released the implementation of DART for Qwen2-VL, and now you can easily evaluate it using lmms-eval!
  • 2025.02.22 πŸ€—πŸ€— We release our latest work DART, a plug-and-play, training-free token reduction method that seamlessly integrates with efficient attention operators. Code is available!

πŸ‘€ Overview

mask

TLDR: We propose DART (Duplication-Aware Reduction of Tokens), a training-free method that prunes vision tokens based on duplication, achieving 88.9% token reduction and 1.99 speed-up while maintaining performance and compatibility with efficient attention operators.

πŸ›  Preparation

LLaVA

  1. Clone this repository.
git clone https://github.com/ZichenWen1/DART
cd DART
  1. Environment Setup and Preparation
 conda create -n DART python=3.10 -y
 conda activate DART
 pip install -e .
 pip install flash-attn --no-build-isolation
  1. Download Multimodal Benchmark

Please follow the detailed instruction in LLaVA-Evaluation.

Qwen2-VL

 conda create -n DART_Qwen2VL python=3.10 -y
 conda activate DART_Qwen2VL
 cd Qwen2-VL/transformers && pip install -e .
 pip install accelerate qwen-vl-utils[decord]
 pip install flash-attn --no-build-isolation
 cd ../../lmms-eval && pip install -e .

Qwen2.5-VL

pip install -U transformers==4.55.4

🎯 Usage

LLaVA

πŸ“– Script Templates

bash scripts/v1_5/eval/[Benchmark].sh [Reduction_Ratio] [Max_Num_Trunction]

🐝 Examples

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa.sh 0.778 128
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/pope.sh 0.778 128
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mme.sh 0.778 128

Qwen2-VL

🐝 Examples

cd Qwen2-VL
bash eval_scripts/lmms_eval.sh True [Reduction_Ratio]

Qwen2.5-VL

🐝 Examples

cd Qwen2_5-VL
bash eval_scripts/lmms_eval.sh True [Reduction_Ratio]

πŸ”‘ License

This project is released under the Apache 2.0 license.

πŸ“Œ Citation

Please consider citing our paper in your publications, if our findings help your research.

@article{wen2025stop,
  title={Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More},
  author={Wen, Zichen and Gao, Yifeng and Wang, Shaobo and Zhang, Junyuan and Zhang, Qintong and Li, Weijia and He, Conghui and Zhang, Linfeng},
  journal={arXiv preprint arXiv:2502.11494},
  year={2025}
}

@article{wen2025token,
  title={Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?},
  author={Wen, Zichen and Gao, Yifeng and Li, Weijia and He, Conghui and Zhang, Linfeng},
  journal={arXiv preprint arXiv:2502.11501},
  year={2025}
}

πŸ‘ Acknowledgment

We extend our gratitude to the open-source efforts of LLaVA, Qwen2-VL, and lmms-eval.

πŸ“© Contact

For any questions about our paper or code, please email [email protected].

About

[EMNLP 2025 main πŸ”₯] Code for "Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published