What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration

Libo Qin; Qiguang Chen; Hao Fei; Zhi Chen; Min Li; Wanxiang Che

What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration

Libo Qin, Qiguang Chen, Hao Fei, Zhi Chen, Min Li, Wanxiang Che

Published: 25 Sept 2024, Last Modified: 26 Dec 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-modal In-Context Learning, Demonstration Retrieval, Demonstration Ordering, In-Context Learning

Abstract: Recently, rapid advancements in Multi-Modal In-Context Learning (MM-ICL) have achieved notable success, which is capable of achieving superior performance across various tasks without requiring additional parameter tuning. However, the underlying rules for the effectiveness of MM-ICL remain under-explored. To fill this gap, this work aims to investigate the research question: "_What factors affect the performance of MM-ICL?_" To this end, we investigate extensive experiments on the three core steps of MM-ICL including demonstration retrieval, demonstration ordering, and prompt construction using 6 vision large language models and 20 strategies. Our findings highlight (1) the necessity of a multi-modal retriever for demonstration retrieval, (2) the importance of intra-demonstration ordering over inter-demonstration ordering, and (3) the enhancement of task comprehension through introductory instructions in prompts. We hope this study can serve as a foundational guide for optimizing MM-ICL strategies in future research.

Primary Area: Natural language processing

Submission Number: 16952

Loading