
Visualization of evaluation metrics across different models (RadarChartPlot.py)
VCapsBench is a comprehensive benchmark for evaluating the quality of video captions generated by vision-language models. This repository provides:
- 🎥 A large-scale dataset with diverse video content
- ⚖️ Fine-grained evaluation results for multiple models
- 📊 Visualization tools for analyzing caption quality
- 🤖 Scripts for generating and evaluating captions
Access the VCapsBench dataset on Hugging Face:
| File | Description | Link |
|---|---|---|
VCapsbench_Caption_ALL.csv.zip |
Raw caption dataset | Download |
gemini_eval_results.zip |
Evaluation results (Gemini-2.5-Pro) | Download |
gpt_eval_results.zip |
Evaluation results (GPT-4.1) | Download |
Supported models:
Qwen2.5-VL-72BQwen2.5-VL-7BQwen2VL-7BInternVL2.5-8BNVILA-8BLLaVA-Video-7BVideoLLaMA3-7B
#!/bin/bash
# eval.sh - Batch evaluate multiple caption outputs
unset http_proxy
unset https_proxy
# Configuration
input_file="VCapsBench_Caption_ALL.csv"
dataset_path="VCapsBench_100KQA.jsonl"
max_workers=128
llm="gemini" # "gemini" or "gpt4o"
output_dir="eval_results-gemini-2.5"
caption_cols=(
"gpt4o_cap"
"Qwen2.5-VL-72B"
"gemini2.5_pro-05-06"
"gemini2.5_pre_flash"
)
# Run evaluations
for caption_col in "${caption_cols[@]}"; do
python3 LLM4eval-m.py \
--input_file "$input_file" \
--dataset_path "$dataset_path" \
--output_dir "$output_dir" \
--caption_col "$caption_col" \
--llm "$llm" \
--max_workers "$max_workers"
done| Script | Description | Output |
|---|---|---|
| RadarChartPlot.py | Compare model performance across metrics | ![]() |
| WordLength.py | Analyze caption length distribution | ![]() |
| wordlength_IR_CR_plot.py | Relationship between length and quality | ![]() |
@article{zhang2025vcapsbench,
title={VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation},
author={Zhang, Shi-Xue and Wang, Hongfa and Huang, Duojun and Li, Xin and Zhu, Xiaobin and Yin, Xu-Cheng},
journal={arXiv preprint arXiv:2505.23484},
year={2025}
}

