Effects of Image Samples on In-Context Learning of Multimodal Large Language Models

Tomoya Ikeda, Shuhei Yamamoto

Published: 2025, Last Modified: 15 Jan 2026iiWAS 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recently, multimodal large language models (LLMs) have gained attention due to their ability to handle various data types such as text, images, and audio. These models are useful for diverse tasks, especially through in-context learning, where task-specific performance can be improved by including a few examples in the prompt. However, effective methods for selecting few-shot samples in multimodal LLMs remain unclear. This study explores the impact of image samples on one-shot in-context learning using a violent image classification task. We investigate what kinds of image examples with associated labels help improve classification performance. Experimental results demonstrate that image samples can significantly affect the model’s performance, as shown by comparing zero-shot and one-shot settings. Furthermore, we analyze characteristics of image samples that lead to better or worse classification results. Our findings clarify the role of image examples in enhancing multimodal LLM performance in one-shot in-context learning scenarios.

External IDs:dblp:conf/iiwas/IkedaY25