What Makes Good In-context Demonstrations in Multimodal Large Language Model?Download PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
Abstract: Recently, multimodal large language models (MLLM) are beginning to exhibit the capability of in-context learning (ICL), enabling them to learn a new task by conditioning solely on some in-context examples, without updating the model parameters. However, existing studies on MLLM often randomly sample a subset of in-context examples and then order these examples randomly. It is still unclear what makes good in-context demonstrations in MLLM. In this paper, we empirically explore the impact of two key factors on the performance of ICL in MLLM to fill this gap: the selection and the order of demonstration examples. We conduct extensive experiments on three multimodal tasks including VQA, image captioning and multimodal image-text classification. Our experimental results show that the above two factors dramatically impact the performance of ICL. Additionally, we summarize our findings and provide takeaway suggestions on how to construct effective demonstrations in MLLM.
Paper Type: short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability
Languages Studied: English
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview