Do Vision-Language Models Learn In Context? Not So Fast

Do Vision-Language Models Learn In Context? Not So Fast

ACL ARR 2025 February Submission4241 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In-context learning enables Large Language Models (LLMs) to learn tasks from demonstration examples without parameter updates. While this ability has been extensively studied in LLMs, its effectiveness in Vision-Language Models (VLMs) remains underexplored. Existing research primarily focuses on a few models trained on interleaved image-text datasets and often overlooks image captioning in their analysis. In this work, we systematically analyze in-context learning in VLMs, evaluating six models across four architectures on three image captioning and four visual question-answering benchmarks. We investigate the influence of prompt design, demonstration selection, model architecture, and training strategies. We also extend our analysis beyond models trained on interleaved datasets to include those trained on image-text pairs, often considered incapable of in-context learning. Our findings show that VLMs still struggle to leverage contextual information to adapt their outputs. However, detailed prompts specifying the task and structure of demonstrations improve performance more than simply concatenating examples. Additionally, while instruction-tuning enhances comprehension of detailed instructions, it reduces reliance on contextual examples and may hinder models' in-context learning capacity. Moreover, VLMs with advanced modality projectors can achieve competitive in-context learning performance even when trained on image-text pairs.

Paper Type: Long

Research Area: Special Theme (conference specific)

Research Area Keywords: In-Context Learning, Vision-Language Models, Multimodality, Evaluation

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 4241

Loading