VisualQA

Visual Question Answering in PyTorch

Setup

Run pip install -r requirements.txt to install all the required python packages.
Download the VQA (2.0) dataset from visualqa.org

Preprocess Images

We assume you use image embeddings, which you can process using preprocess_images.py.

python preprocess_images.py <path to instances_train2014.json> \
    --root <path to dataset root "train2014|val2014"> \
    --split <train|val> --arch <vgg16|vgg19_bn|resnet152>

I have already pre-processed all the COCO images (both train and test sets) using the VGG-16, VGG-19-BN, and ResNet-152 models. To download them, please go into the image_embeddings directory and run make <model>.
Here <model> can be either vgg16, vgg19_bn or resnet152 depending on which model's embeddings you need. E.g. make resnet152

Alternatively, you can find them here.

Running

Training

To run the training and evaluation code with default values, just type

make

If you wish to only run the training code, you can run

make train

If you want to use the raw RGB images from COCO, you can type

make raw_images

This takes the same arguments as make train.

You can get a list of options with make options or python main.py -h.

Check out the Makefile to get an idea of how to run the code.

NOTE The code will take care of all the text preprocessing. Just sit back and relax.

The minimum arguments required are:

The VQA train annotations dataset
The VQA train open-ended questions dataset
Path to the COCO training image feature embeddings
The VQA val annotations dataset
The VQA val open-ended questions dataset
Path to the COCO val image feature embeddings

Evaluation

Evaluating the performance of the model on a fine-grained basis is important. Thus this repo supports evaluating answers to questions based on answer type (e.g. "yes/no" questions).

To evaluate the model, run

make evaluate

You are required to pass in the --resume argument to point to the trained model weights. The other arguments are the same as in training.

Demo

We have a sample demo that you can run

make demo

You can use your own image or question:

python demo.py demo_img.jpg "what room is this?"

Results

NOTE We train and evaluate on the balanced datasets.

The DeeperLSTM model in this repo achieves the following results:

Overall Accuracy is: 49.15

Per Answer Type Accuracy is the following:
other : 38.12
yes/no : 69.55
number : 32.17

Name		Name	Last commit message	Last commit date
Latest commit History 188 Commits
image_embeddings		image_embeddings
models		models
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
arguments.py		arguments.py
dataset.py		dataset.py
demo.py		demo.py
demo_img.jpg		demo_img.jpg
evaluate.py		evaluate.py
main.py		main.py
metrics.py		metrics.py
preprocess_images.py		preprocess_images.py
pyproject.toml		pyproject.toml
trainer.py		trainer.py
uv.lock		uv.lock
visualize.py		visualize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VisualQA

Setup

Preprocess Images

Running

Training

Evaluation

Demo

Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

varunagrawal/VisualQA

Folders and files

Latest commit

History

Repository files navigation

VisualQA

Setup

Preprocess Images

Running

Training

Evaluation

Demo

Results

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages