Abstract: Text-based Visual Question Answering (Text VQA) is a challenging task that requires a comprehensive understanding of scene texts in an image. Scene texts encompass information from both textual and visual modalities. Most existing methods typically treat information of different modalities indiscriminately. However, such approaches may restrict the detailed interaction between textual and visual modalities, leading to biased or incorrect semantic understanding. To address the limitation, we propose a two-stage reasoning network with modality decomposition for Text VQA. In the first stage, we separately handle OCR textual and visual modalities through a modality-specific attention module which is adopted to capture the crucial information of each modality. In the second stage, we aim to enhance the interaction between textual and visual modalities. To achieve this, we introduce a semantic-guided interaction module that incorporates the semantic context to facilitate the alignment of the two modalities. Extensive experiments on the TextVQA and ST-VQA datasets demonstrate that our network achieves competitive performance compared with current state-of-the-art methods.
0 Replies
Loading