ASTR

This is the official PyTorch implementation of the paper: "Question Decomposition for Adaptive Spatio-Temporal Reasoning in Audio-Visual Question Answering".

Authors: Hyunwoo Kim¹, Intaek Shin¹, and Yukyung Choi¹*

¹Sejong University, Republic of Korea
*Corresponding Author

Abstract

Effective Audio-Visual Question Answering (AVQA) mirrors human cognition, necessitating two complementary stages: precise question understanding and strategic evidence selection. We propose ASTR (Adaptive Spatio-Temporal Reasoning), a framework that explicitly decomposes queries into semantic components (Context/Decisive slots) to guide the reasoning process.

Leveraging this fine-grained understanding, we introduce:

Query Semantic Decomposition (QSD): Explicitly disentangles global context from decisive grounding targets.
Entropy-Guided Adaptive Sampling (EGAS): Dynamically optimizes the video sampling ratio based on reasoning uncertainty.
Role-Aware Spatial Refinement (RASR): Pinpoints visual regions aligned with specific reasoning needs.

1. Installation

Please refer to requirements.txt file for detailed environment.

python == 3.10
pytorch == 2.1.2
numpy == 1.24.4

pip install -r requirements.txt

2. Data Preparation

2.1 Download videos

Please download the datasets from their official repositories:

MUSIC-AVQA: Link
MUSIC-AVQA-v2.0: Link

2.2 Feature Extraction

We extracted features following official repository of TSPM.
Please organize the features as follows (or change dir name in ./configs/arguments.py):

data_path
┣ audio_features
┃ ┣ vggish
┣ visual_features
┃ ┣ frame_level_l14
┃ ┣ token_level_l14_8x8 <-- (Note: 2D average pooled to 8x8 resolution)
┣ text_features
┃ ┣ sub-qst-feat-l14-word-77
┗ ┗ replaced-qst-feat-val-word-77

Note:

For token_level_l14_8x8, we performed 2D average pooling to 8x8 resolution during patch feature extraction.
replaced-qst-feat corresponds to word-level features of replaced sentences provided in ./dataset/split_que_id/music_avqa_with_replace.json.

3. Train

To train ASTR on the MUSIC-AVQA dataset with the optimal settings reported in the paper:
- You can set the name of experiment with parameter --checkpoint.
We'll provide the train/test code of MUSIC-AVQA-R and MUSIC-AVQA-v2.0 soon as possible.

CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=1 python -u main_train.py --checkpoint best

4. Test

We provide our best checkpoint file: Download Link.
- Download the checkpoint and place it in ./models/best.pt.
- Run the evaluation command:

CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=1 python -u main_test.py --batch-size 1 --num_workers 2 --checkpoint best --result_dir results/best/

Acknowledgement

Our code is built upon TSPM. We thank the authors for their open-source contribution.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
configs		configs
dataset		dataset
figs		figs
logs		logs
models		models
nets		nets
results		results
.gitignore		.gitignore
README.md		README.md
dataloader.py		dataloader.py
loss.py		loss.py
main_test.py		main_test.py
main_train.py		main_train.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ASTR

Abstract

1. Installation

2. Data Preparation

2.1 Download videos

2.2 Feature Extraction

3. Train

4. Test

Acknowledgement

About

Uh oh!

Contributors

Uh oh!

Languages

sejong-rcv/ASTR

Folders and files

Latest commit

History

Repository files navigation

ASTR

Abstract

1. Installation

2. Data Preparation

2.1 Download videos

2.2 Feature Extraction

3. Train

4. Test

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages