Exploring Supervised Demonstration Retrieval For Multimodal In-Context Learning

Exploring Supervised Demonstration Retrieval For Multimodal In-Context Learning

ACL ARR 2025 May Submission1109 Authors

16 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recently, multimodal in-context learning (ICL) has made significant progress, showing impressive performance across various tasks. Existing works demonstrate that demonstration selection have a big influence on the effectiveness of multimodal ICL. However, these methods focus on extracting visual features and textual features from multimodal examples independently and use them for demonstration retrieval. The influence of multimodal embedding methods for ICL demonstration selection is not fully understood. Besides current mulitmodal ICL demonstration retrieval methods are mainly unsupervised, hindering adaptation to specific features of different tasks. To address these challenges, we firstly compare the modality independent and modality-integrated encoders in representing multimodal examples. Then we introduce MeCO, a supervised training pipeline for multimodal ICL demonstration retriever, cooperating multiple encoders to mitigate their inherent bias and enhance adaptation to specific tasks. Experiments across a wide range of multimodal tasks and MLLMs demonstrate that modality-integrated retrievers show superiority over modality-independent retrievers and our supervised training pipeline significantly improve the performance of multimodal ICL demonstration retrievers which benefit MLLMs on various tasks.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: multimodality, cross-modal information extraction

Languages Studied: English

Submission Number: 1109

Loading