A curated paper list on hallucination in Video Large Language Models (Vid-LLMs), covering 29 benchmarks and 42 mitigation methods. Updated monthly via arXiv search.
📄 Survey Paper: Distorted or Fabricated? A Survey on Hallucination in Video LLMs
🔎 Interactive Browser: Search and filter papers by type, mechanism, venue, year, and resources.
- Taxonomy of Video Hallucinations
- Evaluation Benchmarks — 29 benchmarks
- Mitigation Strategies — 42 methods
- Citation
- Contributing
- [2026/05] Classified recent papers from
new_papers.md, expanding the list to 29 benchmarks and 42 mitigation methods. - [2026/04] Our survey has been accepted to ACL 2026 Findings. 👉 arXiv:2604.12944
- [2026/03] Monthly arXiv search is live. Newly found, unclassified papers are listed in
new_papers.md.
Mechanism-driven taxonomy of Vid-LLM hallucinations. Solid fill = benchmarks; striped fill = mitigation methods.
Note
Benchmarks follow the taxonomy above. Each entry includes venue, date, and available resources.
Legend: = Project Page
= GitHub Repository
= Dataset
- = Not Available
Event Misordering (5 papers)
| Title | Benchmark | Venue | Date | Resources |
|---|---|---|---|---|
| VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding | VidHalluc | CVPR 2025 | 12/2024 | |
| Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation | HAVEN | arXiv 2025 | 03/2025 | |
| MHBench: Demystifying Motion Hallucination in VideoLLMs | MHBench | AAAI 2025 | 01/2025 | |
| KPM-Bench: A Kinematic Parsing Motion Benchmark for Fine-grained Motion-centric Video Understanding | KPM-Bench | arXiv 2026 | 02/2026 | - |
| ARGUS: Hallucination and Omission Evaluation in Video-LLMs | ARGUS | ICCV 2025 | 06/2025 |
Duration Distortion (2 papers)
| Title | Benchmark | Venue | Date | Resources |
|---|---|---|---|---|
| VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models | VideoHallucer | arXiv 2024 | 06/2024 | |
| Online Video Understanding: OVBench and VideoChat-Online | OVBench | CVPR 2025 | 01/2025 |
Frequency Confusion (2 papers)
| Title | Benchmark | Venue | Date | Resources |
|---|---|---|---|---|
| VidHal: Benchmarking Temporal Hallucinations in Vision LLMs | VidHal | arXiv 2024 | 11/2024 | |
| Vript: A Video Is Worth Thousands of Words | Vript | NeurIPS 2024 | 06/2024 |
Character Conflation (2 papers)
| Title | Benchmark | Venue | Date | Resources |
|---|---|---|---|---|
| EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding | EGOILLUSION | EMNLP 2025 | 11/2025 | |
| MESH: Measuring Hallucinations in Large Video Models | MESH | ACM MM 2025 | 09/2025 |
Scene Conflation (1 paper)
| Title | Benchmark | Venue | Date | Resources |
|---|---|---|---|---|
| ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding | ELV-Halluc | arXiv 2025 | 08/2025 |
Object-Action Hallucination (2 papers)
| Title | Benchmark | Venue | Date | Resources |
|---|---|---|---|---|
| VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding | VideoHallu | NeurIPS 2025 | 05/2025 | |
| Models See Hallucinations: Evaluating the Factuality in Video Captioning | FactVC | EMNLP 2023 | 03/2023 |
Scene-Event Hallucination (4 papers)
| Title | Benchmark | Venue | Date | Resources |
|---|---|---|---|---|
| EventHallusion: Diagnosing Event Hallucinations in Video LLMs | EventHallusion | arXiv 2024 | 09/2024 | |
| NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models | NOAH | arXiv 2025 | 11/2025 | |
| RoadSocial: A Diverse VideoQA Dataset and Benchmark for Road Event Understanding from Social Video Narratives | RoadSocial | CVPR 2025 | 02/2025 | |
| CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs | CCTVBench | arXiv 2026 | 04/2026 | - |
Compositional and Factuality Hallucination (6 papers)
| Title | Benchmark | Venue | Date | Resources |
|---|---|---|---|---|
| INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs | INFACT | arXiv 2026 | 03/2026 | - |
| Learning to Decode Against Compositional Hallucination in Video Multimodal Large Language Models | OmniVCHall | arXiv 2026 | 01/2026 | |
| VideoHEDGE: Entropy-Based Hallucination Detection for Video-VLMs via Semantic Clustering and Spatiotemporal Perturbations | VideoHEDGE | arXiv 2026 | 01/2026 | |
| DualFact+: A Multimodal Fact Verification Framework for Procedural Video Understanding | DualFact | ACL 2026 Findings | 04/2026 | - |
| Spatiotemporal Sycophancy: Negation-Based Gaslighting in Video Large Language Models | GasVideo-1000 | arXiv 2026 | 04/2026 | |
| When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models | VisualTextTrap | arXiv 2026 | 04/2026 | - |
Action Attribution (4 papers)
| Title | Benchmark | Venue | Date | Resources |
|---|---|---|---|---|
| AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models | AVHBench | ICLR 2025 | 10/2024 | |
| The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio | CMM | arXiv 2024 | 10/2024 | |
| Exploring Audio Hallucination in Egocentric Video Understanding | Audio Hallucination QA | ICASSP 2026 | 04/2026 | - |
| CrossCheckGPT: Universal Hallucination Ranking for Multimodal Foundation Models | AVHalluBench | arXiv 2024 | 05/2024 |
Emotion Inference (1 paper)
| Title | Benchmark | Venue | Date | Resources |
|---|---|---|---|---|
| EmotionHallucer: Evaluating Emotion Hallucinations in Multimodal Large Language Models | EmotionHallucer | arXiv 2025 | 05/2025 |
Note
Methods are grouped by target hallucination type. Training-Free marks whether extra training is required (✘) or not (✔︎).
Event Misordering (5 papers)
| Title | Method | Venue | Date | Training-Free | Resources |
|---|---|---|---|---|---|
| SEASON: Mitigating Temporal Hallucination in Video LLMs via Self-Diagnostic Contrastive Decoding | SEASON | arXiv 2025 | 12/2025 | ✔︎ | - |
| Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation | Video-thinking (TDPO) | arXiv 2025 | 03/2025 | ✘ | |
| SmartSight: Mitigating Hallucination in Video-LLMs via Temporal Attention Collapse | SmartSight | AAAI 2026 | 12/2025 | ✔︎ | - |
| VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos | VideoTemp-o3 | arXiv 2026 | 02/2026 | ✘ | |
| CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models | MixDPO | arXiv 2026 | 01/2026 | ✘ | - |
Duration Distortion (8 papers)
Frequency Confusion (3 papers)
| Title | Method | Venue | Date | Training-Free | Resources |
|---|---|---|---|---|---|
| VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding | VTG-LLM | AAAI 2025 | 05/2024 | ✘ | |
| Vript: A Video Is Worth Thousands of Words | Vriptor | NeurIPS 2024 | 06/2024 | ✘ | |
| KPM-Bench: A Kinematic Parsing Motion Benchmark for Fine-grained Motion-centric Video Understanding | MoPE | arXiv 2026 | 02/2026 | ✘ | - |
Character Conflation (2 papers)
| Title | Method | Venue | Date | Training-Free | Resources |
|---|---|---|---|---|---|
| Vista-LLaMA: Reducing Hallucination in Video Language Models via Equal Distance to Visual Tokens | Vista-LLaMA | CVPR 2024 | 12/2023 | ✘ | |
| Alternating Perception-Reasoning for Hallucination-Resistant Video Understanding | VideoPLR | arXiv 2025 | 11/2025 | ✘ |
Scene Conflation (2 papers)
| Title | Method | Venue | Date | Training-Free | Resources |
|---|---|---|---|---|---|
| ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding | ELV-Halluc-DPO | arXiv 2025 | 08/2025 | ✘ | |
| Online Video Understanding: OVBench and VideoChat-Online | VideoChat-Online | CVPR 2025 | 01/2025 | ✘ |
Object-Action Hallucination (2 papers)
| Title | Method | Venue | Date | Training-Free | Resources |
|---|---|---|---|---|---|
| Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment | SANTA | WACV 2026 | 12/2025 | ✘ | |
| EventHallusion: Diagnosing Event Hallucinations in Video LLMs | TCD | arXiv 2024 | 09/2024 | ✔︎ |
Scene-Event Hallucination (9 papers)
Both Object-Action & Scene-Event (7 papers)
| Title | Method | Venue | Date | Training-Free | Resources |
|---|---|---|---|---|---|
| VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models | VistaDPO | ICML 2025 | 04/2025 | ✘ | |
| VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding | VideoHallu-GRPO | NeurIPS 2025 | 05/2025 | ✘ | |
| Learning to Decode Against Compositional Hallucination in Video Multimodal Large Language Models | TriCD | arXiv 2026 | 01/2026 | ✘ | |
| Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs | SToP | arXiv 2026 | 04/2026 | ✔︎ | - |
| When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models | VTHM-MoE | arXiv 2026 | 04/2026 | ✘ | - |
| STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models | STEAR | arXiv 2026 | 04/2026 | ✔︎ | - |
| Reinforcing Consistency in Video MLLMs with Structured Rewards | Structured Rewards | arXiv 2026 | 04/2026 | ✘ | - |
Action Attribution (3 papers)
| Title | Method | Venue | Date | Training-Free | Resources |
|---|---|---|---|---|---|
| AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models | AVHModel-Align-FT | ICLR 2025 | 10/2024 | ✘ | |
| AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding | AVCD | NeurIPS 2025 | 05/2025 | ✔︎ | |
| Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization | mrDPO | arXiv 2024 | 10/2024 | ✘ |
Emotion Inference (1 paper)
| Title | Method | Venue | Date | Training-Free | Resources |
|---|---|---|---|---|---|
| EmotionHallucer: Evaluating Emotion Hallucinations in Multimodal Large Language Models | PEP-MEK | arXiv 2025 | 05/2025 | ✔︎ |
If this repository or survey helps your work, please cite:
@article{huang2026distorted,
title={Distorted or Fabricated? A Survey on Hallucination in Video LLMs},
author={Huang, Yiyang and Zhang, Yitian and Wang, Yizhou and Zhang, Mingyuan and Shi, Liang and Zeng, Huimin and Fu, Yun},
journal={arXiv preprint arXiv:2604.12944},
year={2026}
}Tip
Contributions are welcome:
🔀 Pull Request — Add new papers, update resource links, or correct errors
🐛 Open an Issue — Report mistakes, suggest missing papers, or request features
Resource gaps tracked in data/papers.json:
- Add official code links for 33 entries. Browse: missing code
- Add official project pages for 58 entries. Browse: missing project pages
- Add official dataset or leaderboard links when available.
📝 PR Format Guide
Use this structure for new entries:
| [**Paper Title**](paper_link) | Method/Benchmark Name | Venue | MM/YYYY | Resources |
If this repository helps, please consider giving it a ⭐
Maintained by the SmileLab team at Northeastern University.
