A Region Descriptive Pre-training Approach with Self-attention Towards Visual Question Answering

Bisi Bode Kolawole, Minho Lee

Published: 01 Jan 2021, Last Modified: 06 Jun 2025ICONIP (6) 2021EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Concatenation of text (question-answer) and image has been the bedrock of most visual language systems. Existing models concatenate the text (question-answer) and image inputs in a forced manner. In this paper, we introduce a region descriptive pre-training approach with self-attention towards VQA. The model is a new learning method that uses the image region descriptions combined with object labels to create a proper alignment between the text(question-answer) and the image inputs. We study the text associated with each image and discover that extracting the region descriptions from the image and using it during training greatly improves the model’s performance. In this research work, we use the region description extracted from the images as a bridge to map the text and image inputs. The addition of region description makes our model perform better against some recent state-of-the-art models. Experiments demonstrated in this paper show that our model significantly outperforms most of these models.