Prompt-aware Adapter: Towards Learning Effective Visual Tokens for GPT4-Style Multimodal Models

23 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Large Language Models, Multimodal Large Language Models, adapter, vision-language
Abstract: The rapid advancement of Large Language Models (LLMs) has revolutionized chatbot systems, resulting in unprecedented levels of intelligence. Moreover, the recent GPT4-style models have demonstrated extraordinary multi-modal abilities, such as generating human-like responses based on visual inputs and textual prompts. To bridge the gap between the vision and language modalities, GPT4-style models usually learn an adapter that converts the visual inputs to understandable tokens for LLMs. However, those adapters are usually independent of textual prompts, thus outputting invariant visual tokens, regardless of the question of interest. Those prompt-irrelevant visual tokens significantly increase the burden of visual reasoning on LLMs. In this paper, we propose prompt-aware adapter, which is equipped with an ability of dynamically embedding visual inputs based on the prompt. In this way, the proposed adapter extracts the most informative visual clues to the prompt, thus largely facilitating LLMs for visual understanding. Experiments on various questions, including object classification, color recognition, counting and position reasoning, demonstrates the effectiveness of the proposed method. Code will be publicly available.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7780
Loading