This repository contains source code for VIBE: Video-to-Text Information Bottleneck Evaluation for TL;DR
VIBE is an annotation-free method that selects video summaries by scoring task relevance and visual grounding without retraining. Human studies show VIBE improves accuracy and reduces response time over naive VLM summaries and full videos across three datasets.
- System Plot and Major Results
- Project Structure
- Prerequisites
- Key Utility Functions
- Model Configuration
- Citation
Our results show that video captions selected by VIBE can achieve a better trade-off between faster human response time and task accuracy.
.
├── config/ # Configuration files for different tasks
│ ├── AIVideoConf.yaml # AI Conference Video analysis config
│ ├── LongVideoBench.yaml # Long video benchmark config
│ └── TrafficQA.yaml # Traffic QA config
│
└── src/ # Source code
├── AIConfVideo_script/ # AI Conference Video analysis
├── LongVideoBench_script/ # Long video benchmark
├── TrafficQA_script/ # Traffic QA
├── client/ # Client-side code
├── server/ # Server-side code
└── utils/ # Utility functions
See requirements.txt
for the packages and their versions.
The src/utils/
directory contains several important utility modules:
tf_idf.py
: TF-IDF based keyword extraction from text corpus- Extracts important keywords using Term Frequency-Inverse Document Frequency
- Supports customizable parameters for document frequency thresholds
- Handles n-grams and stop words
easyocr_utils.py
: Optical Character Recognition (OCR) utilities- Text detection and recognition in images
- Keyword-based text masking
- Support for multiple languages
- Confidence threshold filtering
scene_detect.py
: Scene detection utilities- Video scene boundary detection
- Keyframe extraction
- Scene transition analysis
fill_in_mask.py
: Image inpainting utilities- Mask filling and image completion
- Region-based image editing
nltk_utils.py
: NLP utilities- Text preprocessing
- Tokenization and lemmatization
- Language model integration
primary_area.py
: Area detection and analysis- Primary region detection
- Area-based feature extraction
- Spatial analysis utilities
The project supports multiple large language models for video understanding:
- InternVL-2.5-8B-MPO
- InternVL3-38B
- Qwen2.5-VL-72B-Instruct-AWQ
You may modify the config files (in folder
/config/
) to run any other models supported by vLLM.
@misc{chen2025vibevideototextinformationbottleneck,
title={VIBE: Video-to-Text Information Bottleneck Evaluation for TL;DR},
author={Shenghui Chen and Po-han Li and Sandeep Chichali and Ufuk Topcu},
year={2025},
eprint={2505.17423},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.17423},
}