- Project website https://sites.google.com/view/larcq
conda create -n larcq python=3.10
conda activate larcq
pip install -r requirements.txt
pip install -e hf-dev-train/transformers-main
pip install -e peft-main
Save the benchmarks in the datasets
folder.
Due to license restriction, we cannot open-source our Clotho_LARCQ and SoundDescs_LARCQ benchmarks. However, we provide the codes of generating the benchmarks. Actually, you can use our codes to generate any LARCQ-style benchmark you want.
The results in the paper are generated in a computer with Nvidia GPUs. Better to configure nvidia-smi
ready.
Our pipeline consists of two main parts: multi-modal retrieval and ALM/LLM refining.
Download the clap-htsat-fused
model from the Hugging Face model link. Save the model in the models
folder.
Download the gpt2
model from the Hugging Face model link. Save the model in the models
folder.
The retrieval scripts are in the folder pipeline/multi_modal_retrieval
. Each script is independent and can be directly executed, which means that you can evaluate any method on any dataset for comprehensive comparison.
(1)retrieval_no_chunking.py
is to retrieve the relevant audios given the queries without any audio chunking or query chunking applied.
Run terminal command python -m pipeline.multi_modal_retrieval.retrieval_no_chunking
Retrieved short-list audios are saved as results/retrieved_results/{benchmark}/retrieved_audios_no_chunking.csv
(2)retrieval_audio_chunking.py
is to retrieve the relevant audios given the queries with only audio chunking max/sum vote and without any query chunking.
Run terminal command python -m pipeline.multi_modal_retrieval.retrieval_audio_chunking
Retrieved short-list audios are saved as results/retrieved_results/{benchmark}/retrieved_audios_audio_chunking.csv
(3)retrieval_query_chunking.py
is to retrieve the relevant audios given the queries with only query chunking max/sum vote and without any audio chunking.
Run terminal command python -m pipeline.multi_modal_retrieval.retrieval_query_chunking
Retrieved short-list audios are saved as results/retrieved_results/{benchmark}/retrieved_audios_query_chunking.csv
(4)retrieval_audio_chunking_query_chunking.py
is to apply the four combinations of audio chunking max vote × query chunking sum vote
, audio chunking sum vote × query chunking sum vote
, audio chunking sum vote × query chunking max vote
, audio chunking max vote × query chunking max vote
to retrieve the audios.
Run terminal command python -m pipeline.multi_modal_retrieval.retrieval_audio_chunking_query_chunking
Retrieved short-list audios are saved as results/retrieved_results/{benchmark}/retrieved_audios_best.csv
In our paper, we use two ALMs, GAMA and Audio-Flamingo, to generate captions for the retrieved audios.
(1) Downlowad the Llama-2-7b-chat-hf-qformer
folder from the Google Drive website link. Save the folder in the models
folder.
Downlowad the stage5_epoch2
folder from the Google Drive website link. Unzip the folder and save the entire folder in the models
folder.
Run terminal command python -m pipeline.alm_llm_refining.run_gama
GAMA captioning results on the retrieved audios are saved as results/alm_results/{benchmark}/retrieved_audios_gama.csv
(2) Downlowad the clapcap_weights_2023.pth
checkpoint from the Hugging Face website link. Save the checkpoint in the models
folder.
Downlowad the opt-iml-max-1.3b
folder from the Hugging Face website link. Save the entire folder in the models
folder.
Downlowad the foundation.pt
checkpoint from the Hugging Face website link. Save the checkpoint in the models
folder.
Run terminal command python -m pipeline.alm_llm_refining.run_flamingo
Audio-Flamingo captioning results on the retrieved audios are saved as results/alm_results/{benchmark}/retrieved_audios_flamingo.csv
In our paper, we use LLM or miniLM to compare the ALM generated response with the text query. You can use any LLM pr miniLM model you want.
(1) Use LLM
-
In our paper, we use Mixtral as the LLM for re-ranking. Follow the tutorial on the Mistral AI website link to set up Mixtral. First, install the
vllm
package (version>=0.6.1.post1
to ensure maximum compatibility with all Mistral models). Second, authenticate on the HuggingFace Hub using your access token$HF_TOKEN
through the commandhuggingface-cli login --token $HF_TOKEN
-
Choose an ALM captioning file
results/alm_results/{benchmark}/retrieved_audios_{ALM}.csv
, likeresults/alm_results/Clotho_LARCQ/retrieved_audios_gama.csv
-
Run terminal command
python -m pipeline.alm_llm_refining.llm_ranking
LLM re-ranking results are saved asresults/llm_results/{benchmark}/{ALM}_llm_ranking.csv
(2) Use miniLM
-
Downlowad the
ms-marco-MiniLM-L-6-v2
folder from the Hugging Face website link. Save the entire folder in themodels
folder. -
Choose an ALM captioning file
results/alm_results/{benchmark}/retrieved_audios_{ALM}.csv
, likeresults/alm_results/Clotho_LARCQ/retrieved_audios_gama.csv
-
Run terminal command
python -m pipeline.alm_llm_refining.cross_encoder_ranking
LLM re-ranking results are saved asresults/llm_results/{benchmark}/{ALM}_cross_encoder_ranking.csv
Finally, we evalute the following final results to obtain all the metrics R@1 and R@5 in our paper.
benchmark = Clotho_LARCQ, SoundDescs_LARCQ
ALM = gama, flamingo
LLM results: results/llm_results/{benchmark}/{ALM}_llm_ranking.csv
miniLM results: results/llm_results/{benchmark}/{ALM}_cross_encoder_ranking.csv
Run terminal command python -m evaluate_final_result
We provide the codes of generating our Clotho_LARCQ benchmark based on Clotho Version 2.1 dataset so that you can follow it to create any LARCQ benchmark you want.
(1) Downlowad the clotho_audio_evaluation.7z
folder and the clotho_captions_evaluation.csv
file from the Zenodo website link. Save them in the datasets/Clotho
folder.
(2) Synthesize long-audio-long-query pairs as LARCQ benchmarks
Run terminal command python -m benchmark_generation.synthesize
The raw LARCQ captions are saved as datasets/Clotho_LARCQ/raw_LARCQ_captions.csv
The LARCQ audios are saved as 'datasets/Clotho_LARCQ/audios/
(3) Run LLMs to refine the raw LARCQ captions
We use two options to refine the raw LARCQ captions into natural long queries.
-
Condense the raw captions Run terminal command
python -m benchmark_generation.llm_condense
The condensed LARCQ captions are saved asdatasets/Clotho_LARCQ/condensed_caption.csv
-
Rephrase the raw captions Run terminal command
python -m benchmark_generation.llm_rephrase
The rephrased LARCQ captions are saved asdatasets/Clotho_LARCQ/rephrased_caption.csv
(1) Downlowad the original SoundDescs dataset from the official GitHub website link. Save them in the datasets/SoundDescs
folder.
(2) We filter for audios between 75-150 seconds with captions exceeding 150 characters as complex queries. This results in 1639 audio-query pairs, forming our SoundDescs-LARCQ benchmark.