Skip to content

Latest commit

 

History

History
112 lines (83 loc) · 5.57 KB

File metadata and controls

112 lines (83 loc) · 5.57 KB

SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing

Overview

3D spatial reasoning in dynamic audio-visual environments remains largely unexplored by current Audio-Visual LLMs (AV-LLMs). SAVVY introduces a training-free reasoning pipeline that enhances AV-LLMs by recovering object trajectories and constructing unified global 3D maps for spatial question answering. Alongside the algorithm, we present SAVVY-Bench, the first benchmark for evaluating dynamic 3D spatial reasoning in audio-visual environments.

⭐ If this project helps your research, a star is appreciated!

Project Components

This project consists of three main components, each hosted in separate repositories:

Quick Links

Component Folder Description
SAVVY Algorithm SAVVY Training-free spatial reasoning pipeline; data preprocessing tools, benchmark data, annotations
SAVVY-Bench Dataset HuggingFace Preview QA pairs in the Hugging Face Data Studio
Evaluation Code SAVVY-Bench Multi-model benchmarking framework

Main Algorithm Repository - Core SAVVY pipeline implementation

The SAVVY algorithm employs a multi-stage reasoning approach:

  • Snapshot Descriptor (SD): Prompts AV-LLMs to extract structured descriptions and initial object trajectories
  • Text-Guided Segmentation (Seg): Refines trajectories using visual foundation models (CLIPSeg, SAM2, ZoeDepth)
  • Spatial Audio & Track Aggregation: Fuses multi-modal cues into a unified 3D map for final predictions

Get Started: Visit the SAVVY repository for installation, usage instructions, and algorithm details.


Hugging Face Repository - Benchmark dataset, annotations, and data preprocessing tools

SAVVY-Bench is constructed through a four-stage pipeline combining automated tools with human validation:

  1. Data Preprocessing: Undistorts videos, aligns multi-stream recordings, processes sensor data
  2. Annotation: Objects and sounding events with 3D spatial annotations
  3. QA Synthesis: Template-based generation of structured question-answer pairs
  4. Quality Review: Human verification ensuring precision

Access Dataset:


Evaluation Repository - We currently support the evaluation of the following AV-LLMs on SAVVY-Bench:

Run Benchmarks: Visit the SAVVY-Bench evaluation repository for setup instructions and evaluation scripts.

Citation

If you use SAVVY or SAVVY-Bench in your research, please cite:

@article{chen2025savvy,
  title={SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing},
  author={Chen, Mingfei and Cui, Zijun and Liu, Xiulong and Xiang, Jinlin and Zheng, Caleb and Li, Jingyuan and Shlizerman, Eli},
  journal={arXiv preprint arXiv:2506.05414},
  year={2025}
}
% This paper will appear in NeurIPS 2025

License

Please refer to individual component licenses: