Filling in the Blank: Rationale-Augmented Prompt Tuning for TextVQADownload PDFOpen Website

Published: 01 Jan 2023, Last Modified: 06 Nov 2023ACM Multimedia 2023Readers: Everyone
Abstract: Recently, generative Text-based visual question answering (TextVQA) methods, which are often based on language models, have exhibited impressive results and drawn increasing attention. However, due to the inconsistencies in both input forms and optimization objectives, the power of pretrained language models is not fully explored, resulting in the need for large amounts of training data. In this work, we rethink the characteristics of the TextVQA task and find that scene text is indeed a special kind of language embedded in images. To this end, we propose a text-centered generative framework FITB (stands for Filling In The Blank), in which multimodal information is mainly represented in textual form and rationale-augmented prompting is involved. Specifically, an infilling-based prompt strategy is utilized to formulate TextVQA as a novel problem of filling in the blank with proper scene text according to the language context. Furthermore, aiming to prevent the model from language bias overfitting, we design a rough answer grounding module to provide visual rationales for promoting multimodal reasoning. Extensive experiments verify the superiority of FITB in both fully-supervised and zero-shot/few-shot settings. Notably, even with a saving of about 64M data, FITB surpasses the state-of-the-art method by 3.00% and 1.99% on TextVQA and ST-VQA datasets, respectively.
0 Replies

Loading