SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing

Mingfei Chen, Zijun Cui, Xiulong Liu, Jinlin Xiang, Caleb Zheng, Jingyuan Li, Eli Shlizerman
University of Washington Equal contribution

Overview

3D spatial reasoning in dynamic audio-visual environments remains largely unexplored by current Audio-Visual LLMs (AV-LLMs). SAVVY introduces a training-free reasoning pipeline that enhances AV-LLMs by recovering object trajectories and constructing unified global 3D maps for spatial question answering. Alongside the algorithm, we present SAVVY-Bench, the first benchmark for evaluating dynamic 3D spatial reasoning in audio-visual environments.

⭐ If this project helps your research, a star is appreciated!

Project Components

This project consists of three main components, each hosted in separate repositories:

Quick Links

Component	Folder	Description
SAVVY Algorithm	SAVVY	Training-free spatial reasoning pipeline; data preprocessing tools, benchmark data, annotations
SAVVY-Bench Dataset	HuggingFace	Preview QA pairs in the Hugging Face Data Studio
Evaluation Code	SAVVY-Bench	Multi-model benchmarking framework

SAVVY Algorithm

Main Algorithm Repository - Core SAVVY pipeline implementation

The SAVVY algorithm employs a multi-stage reasoning approach:

Snapshot Descriptor (SD): Prompts AV-LLMs to extract structured descriptions and initial object trajectories
Text-Guided Segmentation (Seg): Refines trajectories using visual foundation models (CLIPSeg, SAM2, ZoeDepth)
Spatial Audio & Track Aggregation: Fuses multi-modal cues into a unified 3D map for final predictions

Get Started: Visit the SAVVY repository for installation, usage instructions, and algorithm details.

SAVVY-Bench Dataset

Hugging Face Repository - Benchmark dataset, annotations, and data preprocessing tools

SAVVY-Bench is constructed through a four-stage pipeline combining automated tools with human validation:

Data Preprocessing: Undistorts videos, aligns multi-stream recordings, processes sensor data
Annotation: Objects and sounding events with 3D spatial annotations
QA Synthesis: Template-based generation of structured question-answer pairs
Quality Review: Human verification ensuring precision

Access Dataset:

Download from Hugging Face
Data preprocessing scripts available in the SAVVY repository under data_utils/ folder

AV-LLM Evaluation Code on SAVVY-Bench

Evaluation Repository - We currently support the evaluation of the following AV-LLMs on SAVVY-Bench:

Run Benchmarks: Visit the SAVVY-Bench evaluation repository for setup instructions and evaluation scripts.

Citation

If you use SAVVY or SAVVY-Bench in your research, please cite:

@article{chen2025savvy,
  title={SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing},
  author={Chen, Mingfei and Cui, Zijun and Liu, Xiulong and Xiang, Jinlin and Zheng, Caleb and Li, Jingyuan and Shlizerman, Eli},
  journal={arXiv preprint arXiv:2506.05414},
  year={2025}
}
% This paper will appear in NeurIPS 2025

License

Please refer to individual component licenses:

SAVVY Algorithm: See LICENSE
Original Aria Dataset: Project Aria License Agreement
EgoLifter submodule: Apache License 2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing

Mingfei Chen, Zijun Cui, Xiulong Liu, Jinlin Xiang, Caleb Zheng, Jingyuan Li, Eli Shlizerman
University of Washington Equal contribution

Overview

Project Components

Quick Links

SAVVY Algorithm

SAVVY-Bench Dataset

AV-LLM Evaluation Code on SAVVY-Bench

Citation

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing

Mingfei Chen*, Zijun Cui*, Xiulong Liu*, Jinlin Xiang, Caleb Zheng, Jingyuan Li, Eli Shlizerman University of Washington * Equal contribution

Overview

Project Components

Quick Links

SAVVY Algorithm

SAVVY-Bench Dataset

AV-LLM Evaluation Code on SAVVY-Bench

Citation

License

Mingfei Chen, Zijun Cui, Xiulong Liu, Jinlin Xiang, Caleb Zheng, Jingyuan Li, Eli Shlizerman
University of Washington Equal contribution