So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual Question Answering

Wenbo Zheng, Lan Yan, Fei-Yue Wang

Published: 01 Jan 2024, Last Modified: 13 Nov 2024IEEE Trans. Syst. Man Cybern. Syst. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: While texts related to images convey fundamental messages for scene understanding and reasoning, text-based visual question answering tasks concentrate on visual questions that require reading texts from images. However, most current methods add multimodal features that are independently extracted from a given image into a reasoning model without considering their inter- and intra-relationships according to three modalities (i.e., scene texts, questions, and images). To this end, we propose a novel text-based visual question answering model, multimodal graph reasoning. Our model first extracts intramodality relationships by taking the representations from identical modalities as semantic graphs. Then, we present graph multihead self-attention, which boosts each graph representation through graph-by-graph aggregation to capture the intermodality relationship. It is a case of “so many heads, so many wits” in the sense that as more semantic graphs are involved in this process, each graph representation becomes more effective. Finally, these representations are reprojected, and we perform answer prediction with their outputs. The experimental results demonstrate that our approach realizes substantially better performance compared with other state-of-the-art models.