Keywords: Text-VQA, Contrastive Learning
TL;DR: we propose a novel two-stage model with cross-level contrastive learning for Text-based Visual Question Answering task to explicit alignment between object-level and scene text-level in visual-linguistic modalities.
Abstract: Text-based Visual Question Answering (Text-VQA) task requires the model to learn effective representations in a joint semantic space. Previous methods lack the explicit alignment between object-level and scene text-level in visual-linguistic modalities. To address this issue, we propose a novel two-stage model with cross-level contrastive learning. In the first pre-training stage, we encourage the model to enhance the proximity of cross-level cross-modal representations within the same image in semantic space, while also distancing representations from different images. Then we fine-tune the model to generate the answer to the question. Experimental results on a widely used benchmark dataset demonstrate the effectiveness of our proposed model compared to existing methods.
Submission Number: 59
Loading