Skip to content

Question Decomposition for Adaptive Spatio-Temporal Reasoning in Audio-Visual Question Answering

Notifications You must be signed in to change notification settings

sejong-rcv/ASTR

Repository files navigation

ASTR

This is the official PyTorch implementation of the paper: "Question Decomposition for Adaptive Spatio-Temporal Reasoning in Audio-Visual Question Answering".

Authors: Hyunwoo Kim¹, Intaek Shin¹, and Yukyung Choi¹*

¹Sejong University, Republic of Korea
*Corresponding Author


Abstract

Effective Audio-Visual Question Answering (AVQA) mirrors human cognition, necessitating two complementary stages: precise question understanding and strategic evidence selection. We propose ASTR (Adaptive Spatio-Temporal Reasoning), a framework that explicitly decomposes queries into semantic components (Context/Decisive slots) to guide the reasoning process.

Framework Overview

Leveraging this fine-grained understanding, we introduce:

  1. Query Semantic Decomposition (QSD): Explicitly disentangles global context from decisive grounding targets.
  2. Entropy-Guided Adaptive Sampling (EGAS): Dynamically optimizes the video sampling ratio based on reasoning uncertainty.
  3. Role-Aware Spatial Refinement (RASR): Pinpoints visual regions aligned with specific reasoning needs.

1. Installation

Please refer to requirements.txt file for detailed environment.

  • python == 3.10
  • pytorch == 2.1.2
  • numpy == 1.24.4
pip install -r requirements.txt

2. Data Preparation

2.1 Download videos

Please download the datasets from their official repositories:

  • MUSIC-AVQA: Link
  • MUSIC-AVQA-v2.0: Link

2.2 Feature Extraction

  • We extracted features following official repository of TSPM.
  • Please organize the features as follows (or change dir name in ./configs/arguments.py):
data_path
┣ audio_features
┃ ┣ vggish
┣ visual_features
┃ ┣ frame_level_l14
┃ ┣ token_level_l14_8x8 <-- (Note: 2D average pooled to 8x8 resolution)
┣ text_features
┃ ┣ sub-qst-feat-l14-word-77
┗ ┗ replaced-qst-feat-val-word-77

Note:

  • For token_level_l14_8x8, we performed 2D average pooling to 8x8 resolution during patch feature extraction.
  • replaced-qst-feat corresponds to word-level features of replaced sentences provided in ./dataset/split_que_id/music_avqa_with_replace.json.

3. Train

  • To train ASTR on the MUSIC-AVQA dataset with the optimal settings reported in the paper:
    • You can set the name of experiment with parameter --checkpoint.
  • We'll provide the train/test code of MUSIC-AVQA-R and MUSIC-AVQA-v2.0 soon as possible.
CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=1 python -u main_train.py --checkpoint best

4. Test

  • We provide our best checkpoint file: Download Link.
    • Download the checkpoint and place it in ./models/best.pt.
    • Run the evaluation command:
CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=1 python -u main_test.py --batch-size 1 --num_workers 2 --checkpoint best --result_dir results/best/

Acknowledgement

Our code is built upon TSPM. We thank the authors for their open-source contribution.

About

Question Decomposition for Adaptive Spatio-Temporal Reasoning in Audio-Visual Question Answering

Resources

Stars

Watchers

Forks

Contributors

Languages