From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models

Jiaxian Guo; Junnan Li; Dongxu Li; Anthony Tiong; Boyang Li; Dacheng Tao; Steven HOI

From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models

Jiaxian Guo, Junnan Li, Dongxu Li, Anthony Tiong, Boyang Li, Dacheng Tao, Steven HOI

22 Sept 2022 (modified: 04 Aug 2025)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Large Language Model, Visual Question Answer, Prompts, Zero-Shot

Abstract: Large language models (LLMs) have demonstrated excellent zero-shot generalization to new tasks. However, effective utilization of LLMs for zero-shot visual question-answering (VQA) remains challenging, primarily due to the modality disconnection and task disconnection between LLM and VQA task. End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive. To address this issue, we propose \emph{Img2Prompt}, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform VQA tasks without end-to-end training. In order to provide such prompts, we further employ LLM-agnostic models to provide prompts that can describe image content and self-constructed question-answer pairs, which can effectively guide LLM to perform VQA tasks. Img2Prompt offers the following benefits: 1) It is LLM-agnostic and can work with any LLM to perform VQA. 2) It renders end-to-end training unnecessary and significantly reduces the cost of deploying LLM for VQA tasks. 3) It achieves comparable or better performance than methods relying on end-to-end training. On the challenging A-OKVQA dataset, our method outperforms some few-shot methods by as much as 20\%.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/from-images-to-textual-prompts-zero-shot-vqa/code)

13 Replies

Loading