TRANS-VQA: Fully Transformer-Based Image Question-Answering Model Using Question-guided Vision Attention

Authors

  • Dipali Koshti Sir Padampat Singhania University, India
  • Ashutosh Gupta Sir Padampat Singhania University, India
  • Mukesh Kalla Sir Padampat Singhania University, India
  • Arvind Sharma Sir Padampat Singhania University, India

DOI:

https://doi.org/10.4114/intartif.vol27iss73pp111-128

Keywords:

Visual question answering, Transformer-based VQA, BERT-based VQA, Question-guided VQA

Abstract

Understanding multiple modalities and relating them is an easy task for humans. But for machines, this is a stimulating task. One such multi-modal reasoning task is Visual question answering which demands the machine to produce an answer for the natural language query asked based on the given image. Although plenty of work is done in this field, there is still a challenge of improving the answer prediction ability of the model and breaching human accuracy. A novel model for answering image-based questions based on a transformer has been proposed. The proposed model is a fully Transformer-based architecture that utilizes the power of a transformer for extracting language features as well as for performing joint understanding of question and image features. The proposed VQA model utilizes F-RCNN for image feature extraction. The retrieved language features and object-level image features are fed to a decoder inspired by the Bi-Directional Encoder Representation Transformer - BERT architecture that learns jointly the image characteristics directed by the question characteristics and rich representations of the image features are obtained. Extensive experimentation has been carried out to observe the effect of various hyperparameters on the performance of the model. The experimental results demonstrate that the model’s ability to predict the answer increases with the increase in the number of layers in the transformer’s encoder and decoder. The proposed model improves upon the previous models and is highly scalable due to the introduction of the BERT. Our best model reports 72.31% accuracy on the test-standard split of the VQAv2 dataset.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

Downloads

Published

2024-01-10

How to Cite

Koshti, D., Gupta, A., Kalla, M., & Sharma, A. (2024). TRANS-VQA: Fully Transformer-Based Image Question-Answering Model Using Question-guided Vision Attention. Inteligencia Artificial, 27(73), 111–128. https://doi.org/10.4114/intartif.vol27iss73pp111-128