My MSc Dissertation Project had the goal of developing a Deep Learning algorithm capable of improving VQA performance of a state-of-the-art architecture while mimicking human attention, using the VQA-HAT dataset. The Project was successful and some results can be found below.
The Dissertation PDF can be found in this repository - msc-dissertation.pdf.
Code adapted from Stacked attention networks for image question answering.
The code is in python and uses Theano package.
- Python 2.7
- Theano
- Numpy
- h5py
To train a model,
cd src/scripts; python mtl_san_deepfix.py
There is another README.md inside src describing the files there.
Some results can be found below. "Human Attention and Answer" is the ground-truth. "SAN" is our main baseline - the Stacked Attention Network. Our main algorithm is "MTL SAN+DeepFix", able to improve VQA accuracy of our baseline SAN, while mimicking human attention. Remaining models are different baselines. Thorough explanations can be found in msc-dissertation.pdf