The task is to predict the movie genres from movie trailers (video frames and audio spectrogram), movie plot (text), poster (image) and metadata by using the Moviescope dataset. A new multimodal transformer architecture is proposed (MulT-GMU), which is an extension of MulT model (with dynamic modality fusion).
This repo contains the code used for the publication of a paper at NAACL 2021 MAI Workshop: Multimodal Weighted Fusion of Transformers for Movie Genre Classification (MulT-GMU)
Example of comman to run the training script
>> python mmbt/train.py --batch_sz 4 --gradient_accumulation_steps 32 --savedir /home/user/mmbt_experiments/model_save_mmtr --name moviescope_VideoTextPosterGMU_mmtr_model_run --data_path /home/user --task moviescope --task_type multilabel --model mmtrvpp --num_image_embeds 3 --patience 5 --dropout 0.1 --lr 5e-05 --warmup 0.1 --max_epochs 100 --seed 1 --num_heads 6 --orig_d_v 4096 --output_gates
- MulT: Multimodal Transformer for Unaligned Multimodal Language Sequences.
- MMBT: "Supervised Multimodal Bitransformers for Classifying Images and Text.
- Moviescope Dataset: Moviescope: Large-scale Analysis of Movies using Multiple Modalities.
- GMU Gated Multimodal Units for Information Fusion by Arevalo et al.
- python 3.7.6
- torch 1.5.1
- tokenizers 0.9.4
- transformers 4.2.2
- Pillow 7.0.0