KTMN: Knowledge-driven Two-stage Modulation Network for visual question answering

Jingya Shi; Dezhi Han; Chongqing Chen; Xiang Shen

KTMN: Knowledge-driven Two-stage Modulation Network for visual question answering

Jingya Shi, Dezhi Han, Chongqing Chen, Xiang Shen

Published: 01 Jan 2024, Last Modified: 21 May 2025Multim. Syst. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Existing visual question answering (VQA) methods introduce the Transformer as the backbone architecture for intra- and inter-modal interactions, demonstrating its effectiveness in dependency relationship modeling and information alignment. However, the Transformer’s inherent attention mechanisms tend to be affected by irrelevant information and do not utilize the positional information of objects in the image during the modelling process, which hampers its ability to adequately focus on key question words and crucial image regions during answer inference. Considering this issue is particularly pronounced on the visual side, this paper designs a Knowledge-driven Two-stage Modulation self-attention mechanism to optimize the internal interaction modeling of image sequences. In the first stage, we integrate textual context knowledge and the geometric knowledge of visual objects to modulate and optimize the query and key matrices. This effectively guides the model to focus on visual information relevant to the context and geometric knowledge during the information selection process. In the second stage, we design an information comprehensive representation to apply a secondary modulation to the interaction results from the first modulation. This further guides the model to fully consider the overall context of the image during inference, enhancing its global understanding of the image content. On this basis, we propose a Knowledge-driven Two-stage Modulation Network (KTMN) for VQA, which enables fine-grained filtering of redundant image information while more precisely focusing on key regions. Finally, extensive experiments conducted on the datasets VQA v2 and CLEVR yielded Overall accuracies of 71.36% and 99.20%, respectively, providing ample validation of the proposed method’s effectiveness and rationality. Source code is available at https://github.com/shijingya/KTMN.

Loading