[Preprint 2025] Official code of the paper “TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility” Keywords: Video-Language Models, Physics Plausibility, Video Reasoning, Trajectory-aware Attention, Benchmarking
Saman Motamed1,2✉,
Minghao Chen2,
Luc Van Gool1,
Iro Laina2
1INSAIT, Sofia University "St. Kliment Ohridski" 2Visual Geometry Group, University of Oxford
Please support our work by leaving a star on our repo! ⭐⭐⭐
- [202509/10] 📢 📢 TRAVL tuning dataset is released.
- [202509/10] 📢 📢 ImplausiBench benchmark is released.
- [202509/10] 📢 📢 The paper is on arXiv
- Training code (LLaVA-NeXT + TRAVL)
- LLM Judge evaluation script
- Release LLaVA-NeXT + TRAVL weights
- Modern VLMs can give an overview of a video quite well, yet they fail to reason about more finegrained physical interactions in a video.
- TRAVL is a light, modular attention recipe— spatial + trajectory-aware temporal—that helps VLMs judge physics implausibility more reliably.
- ImplausiBench is our 300-video benchmark (150 real, 150 implausible) with paired, style-matched videos and grounded MCQs to evaluate visual-temporal reasoning beyond language shortcuts.
- TRAVL Dataset is our curated dataset of 3,482 videos with 19,708 physics‑focused Q/A pairs.
- Paper: (arXiv link coming soon)
- Project page: https://sam-motamed.github.io/projects/TRAVL
A 300-video benchmark (150 real, 150 implausible) for evaluating visual-temporal physics plausibility with paired clips (shared first frame & style) and grounded MCQs that reduce language-only shortcuts.
- Hugging Face → https://huggingface.co/datasets/INSAIT-Institute/ImplausiBench
- What’s inside
ImplausiBench/real/*.mp4
&ImplausiBench/implausible/*.mp4
ImplausiBench-MCQA.json
grounded multiple-choice questions per pair
- Metrics reported: Human & LLM-judge accuracy on Real / Implausible subsets (150 each)
git lfs install
git clone https://huggingface.co/datasets/INSAIT-Institute/ImplausiBench data/implausibench
- Scale: 3,482 videos • 19,708 QA pairs
- Composition: real + implausible clips (e.g., Physics-IQ, Impossible Videos, Video-ChatGPT)
- Link: https://huggingface.co/datasets/INSAIT-Institute/TRAVL
# Option A: huggingface_hub
pip install -U huggingface_hub
python - << 'PY'
from huggingface_hub import snapshot_download
snapshot_download(repo_id="INSAIT-Institute/TRAVL", repo_type="dataset", local_dir="data/travl")
PY
# Option B: git-lfs
git lfs install
git clone https://huggingface.co/datasets/INSAIT-Institute/TRAVL data/travl
Accuracies (%) on Implausible (generated) and Real subsets (150 videos each).
We report both Human and LLM-judge scores. Sorted by Implausible — Human (best → worst).
Model | Implausible — Human | Implausible — LLM | Real — Human | Real — LLM |
---|---|---|---|---|
LLaVA-NeXT (TRAVL) | 52.7 |
28.7 |
47.3 |
31.3 |
Gemini 2.5 Pro | 41.3 |
29.3 |
100.0 |
78.0 |
LLaVA-NeXT (SFT) | 34.0 |
22.0 |
45.3 |
23.3 |
GPT-4o | 32.7 |
21.3 |
84.7 |
64.0 |
Qwen2.5-VL | 18.7 |
12.0 |
96.7 |
74.7 |
InternVL 2.5 | 12.7 |
4.7 |
92.7 |
76.0 |
LLaVA-NeXT (pretrained) | 3.3 |
2.7 |
98.7 |
62.7 |
@article{{motamed2025travl,
title={TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility},
author={Saman Motamed and Minghao Chen and Luc Van Gool and Iro Laina},
year={2025},
eprint={2510.07550},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Questions or feedback? Reach us at [email protected].
Our work was made possible by efforts from following works. Thanks to all the contributors!