This is the official PyTorch implementation of the paper: "Question Decomposition for Adaptive Spatio-Temporal Reasoning in Audio-Visual Question Answering".
Authors: Hyunwoo Kim¹, Intaek Shin¹, and Yukyung Choi¹*
¹Sejong University, Republic of Korea
*Corresponding Author
Effective Audio-Visual Question Answering (AVQA) mirrors human cognition, necessitating two complementary stages: precise question understanding and strategic evidence selection. We propose ASTR (Adaptive Spatio-Temporal Reasoning), a framework that explicitly decomposes queries into semantic components (Context/Decisive slots) to guide the reasoning process.
Leveraging this fine-grained understanding, we introduce:
- Query Semantic Decomposition (QSD): Explicitly disentangles global context from decisive grounding targets.
- Entropy-Guided Adaptive Sampling (EGAS): Dynamically optimizes the video sampling ratio based on reasoning uncertainty.
- Role-Aware Spatial Refinement (RASR): Pinpoints visual regions aligned with specific reasoning needs.
Please refer to requirements.txt file for detailed environment.
- python == 3.10
- pytorch == 2.1.2
- numpy == 1.24.4
pip install -r requirements.txtPlease download the datasets from their official repositories:
- We extracted features following official repository of TSPM.
- Please organize the features as follows (or change dir name in
./configs/arguments.py):
data_path
┣ audio_features
┃ ┣ vggish
┣ visual_features
┃ ┣ frame_level_l14
┃ ┣ token_level_l14_8x8 <-- (Note: 2D average pooled to 8x8 resolution)
┣ text_features
┃ ┣ sub-qst-feat-l14-word-77
┗ ┗ replaced-qst-feat-val-word-77
Note:
- For
token_level_l14_8x8, we performed 2D average pooling to 8x8 resolution during patch feature extraction. replaced-qst-featcorresponds to word-level features of replaced sentences provided in./dataset/split_que_id/music_avqa_with_replace.json.
- To train ASTR on the MUSIC-AVQA dataset with the optimal settings reported in the paper:
- You can set the name of experiment with parameter
--checkpoint.
- You can set the name of experiment with parameter
- We'll provide the train/test code of
MUSIC-AVQA-RandMUSIC-AVQA-v2.0soon as possible.
CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=1 python -u main_train.py --checkpoint best
- We provide our best checkpoint file: Download Link.
- Download the checkpoint and place it in
./models/best.pt. - Run the evaluation command:
- Download the checkpoint and place it in
CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=1 python -u main_test.py --batch-size 1 --num_workers 2 --checkpoint best --result_dir results/best/
Our code is built upon TSPM. We thank the authors for their open-source contribution.
