Abstract: Recently, multimodal in-context learning (ICL) has made significant progress, showing impressive performance across various tasks. Existing works demonstrate that demonstration selection have a big influence on the effectiveness of multimodal ICL. However, these methods focus on extracting visual features and textual features from multimodal examples independently and use them for demonstration retrieval. The influence of multimodal embedding methods for ICL demonstration selection is not fully understood. Besides current mulitmodal ICL demonstration retrieval methods are mainly unsupervised, hindering adaptation to specific features of different tasks. To address these challenges, we firstly compare the modality independent and modality-integrated encoders in representing multimodal examples. Then we introduce MeCO, a supervised training pipeline for multimodal ICL demonstration retriever, cooperating multiple encoders to mitigate their inherent bias and enhance adaptation to specific tasks. Experiments across a wide range of multimodal tasks and MLLMs demonstrate that modality-integrated retrievers show superiority over modality-independent retrievers and our supervised training pipeline significantly improve the performance of multimodal ICL demonstration retrievers which benefit MLLMs on various tasks.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, cross-modal information extraction
Languages Studied: English
Submission Number: 1109
Loading