Mingfei Chen*,
Zijun Cui*,
Xiulong Liu*,
Jinlin Xiang,
Caleb Zheng,
Jingyuan Li,
Eli Shlizerman
University of Washington
* Equal contribution
3D spatial reasoning in dynamic audio-visual environments remains largely unexplored by current Audio-Visual LLMs (AV-LLMs). SAVVY introduces a training-free reasoning pipeline that enhances AV-LLMs by recovering object trajectories and constructing unified global 3D maps for spatial question answering. Alongside the algorithm, we present SAVVY-Bench, the first benchmark for evaluating dynamic 3D spatial reasoning in audio-visual environments.
⭐ If this project helps your research, a star is appreciated!
This project consists of three main components, each hosted in separate repositories:
| Component | Folder | Description |
|---|---|---|
| SAVVY Algorithm | SAVVY | Training-free spatial reasoning pipeline; data preprocessing tools, benchmark data, annotations |
| SAVVY-Bench Dataset | HuggingFace | Preview QA pairs in the Hugging Face Data Studio |
| Evaluation Code | SAVVY-Bench | Multi-model benchmarking framework |
Main Algorithm Repository - Core SAVVY pipeline implementation
The SAVVY algorithm employs a multi-stage reasoning approach:
- Snapshot Descriptor (SD): Prompts AV-LLMs to extract structured descriptions and initial object trajectories
- Text-Guided Segmentation (Seg): Refines trajectories using visual foundation models (CLIPSeg, SAM2, ZoeDepth)
- Spatial Audio & Track Aggregation: Fuses multi-modal cues into a unified 3D map for final predictions
Get Started: Visit the SAVVY repository for installation, usage instructions, and algorithm details.
Hugging Face Repository - Benchmark dataset, annotations, and data preprocessing tools
SAVVY-Bench is constructed through a four-stage pipeline combining automated tools with human validation:
- Data Preprocessing: Undistorts videos, aligns multi-stream recordings, processes sensor data
- Annotation: Objects and sounding events with 3D spatial annotations
- QA Synthesis: Template-based generation of structured question-answer pairs
- Quality Review: Human verification ensuring precision
Access Dataset:
- Download from Hugging Face
- Data preprocessing scripts available in the SAVVY repository under
data_utils/folder
Evaluation Repository - We currently support the evaluation of the following AV-LLMs on SAVVY-Bench:
- VideoLLaMA2
- MiniCPM-o2.6
- Video-SALMONN (video-salmonn-13b)
- Gemini 2.5 Flash / Pro
- Ola
- EgoGPT
- Longvale
Run Benchmarks: Visit the SAVVY-Bench evaluation repository for setup instructions and evaluation scripts.
If you use SAVVY or SAVVY-Bench in your research, please cite:
@article{chen2025savvy,
title={SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing},
author={Chen, Mingfei and Cui, Zijun and Liu, Xiulong and Xiang, Jinlin and Zheng, Caleb and Li, Jingyuan and Shlizerman, Eli},
journal={arXiv preprint arXiv:2506.05414},
year={2025}
}
% This paper will appear in NeurIPS 2025Please refer to individual component licenses:
- SAVVY Algorithm: See LICENSE
- Original Aria Dataset: Project Aria License Agreement
- EgoLifter submodule: Apache License 2.0