Understanding and Improving In-Context Learning on Vision-language Models

Understanding and Improving In-Context Learning on Vision-language Models

ICLR 2024 Workshop ME-FoMo Submission9 Authors

Published: 04 Mar 2024, Last Modified: 29 Apr 2024ME-FoMo 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multimodal LLMs, in-context learning, multimodal models, vision-language models

TL;DR: This paper investigates the in-context learning ability of multimodal LLMs and proposes a simple yet effective method to choose demonstrations.

Abstract: In-context learning (ICL) on large language models (LLMs) has received great attention, and this technique can also be applied to vision-language models (VLMs) built upon LLMs. These VLMs can respond to queries by conditioning responses on a series of multimodal demonstrations, which comprise images, queries, and answers. Though ICL has been extensively studied on LLMs, its research on VLMs remains limited. The additional visual information in the demonstrations motivates the following research questions: which modality in the demonstration is more significant? How can we select effective multimodal demonstrations to enhance ICL performance? This study investigates the significance of both visual and language information. Our findings indicate that ICL in VLMs is predominantly driven by the textual information in the demonstrations whereas the visual information in the demonstrations barely affects the ICL performance. Motivated by our analysis, we propose a simple yet effective approach, termed Mixed Modality In-Context Example Selection (MMICES). MMICES considers both visual and language modalities when selecting demonstrations and shows better ICL performance. Extensive experiments are conducted to support our findings and improvement of the ICL performance of VLMs.

Submission Number: 9

Loading