Show Exemplars and Tell Me What You See: In-Context Learning with Frozen Large Language Models for TextVQA

Published: 01 Jan 2024, Last Modified: 08 Oct 2025PRCV (7) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Modern Large Visual Language Models (LVLMs) can transfer Large Language Models (LLMs)’ powerful abilities to visual domains by combining LLMs with the pre-trained visual encoder, and can also leverage in-context learning originated from LLMs to achieve remarkable performance in the Text-based Visual Question Answering (TextVQA) task. However, the alignment process between vision and language requires a significant amount of training resources. This study introduces SETS (stands for Show Exemplars and Tell me what you See), a straightforward yet effective in-context learning framework for TextVQA. SETS consists of two components, an LLM for reasoning and decision-making, as well as a set of external tools that extract visual entities in scene images, including scene text and objects, to assist the LLM. More specifically, SETS selects visual entities relevant to questions, constructs their spatial relationships, and customizes task-specific instructions. Furthermore, given these instructions, a two-round inference strategy is applied to automatically choose the final predicted answer. Extensive experiments on three widely used TextVQA datasets demonstrate that SETS enables frozen LLMs like Vicuna and LLaMA2 to achieve superior performance when compared with LVLMs counterparts.
Loading