Cross-Modal Attention Networks with Modality Disentanglement for Scene-Text VQA

Zeqin Fang, Lin Li, Zhongwei Xie, Jingling Yuan

Published: 01 Jan 2022, Last Modified: 14 May 2023ICME 2022Readers: Everyone

Abstract: Understanding the texts in visual scenes is essential for reasoning tasks when texts carry key information. Therefore, it is a common way to learn multi-modal representation from visual scenes to support Scene-Text Visual Question&Answering. Different modalities, such as text and image, are embedded into a joint semantic space where attention mechanism is widely applied. This paper regards Scene-Text VQA as a kind of cross-modal task where the joint embedding between a scene image and its text answer is a semantic bridge to share the strong semantic clues with the two modalities. To this end, our proposed framework firstly divides the scene recognition features into two types: visual and textual features. And then external cross-modal pre-trained features are introduced to simultaneously guide the representation of visual and text features through cross-modal attention net-works. Finally, Transformer is applied as a decoder to output answers. Conducted on a public Scene-Text VQA dataset, experimental results show that our proposed framework out-performs existing approaches in terms of accuracy.

0 Replies