DemoReranker: Enhancing the In-context Learning Capability of Multi-modal Large Models via Demonstration Reranking
Keywords: large vision language models, in context learning, visual question answering
Abstract: In the deployment of Large Multi-modal Models (LMMs), researchers and practitioners often rely on simplistic strategies for in-context learning (ICL), such as reusing fixed demonstrations across diverse samples or retrieving candidates directly using the CLIP model. These approaches may not ensure that selected demonstrations align optimally with LMM requirements. To bridge this gap, we introduce DemoReranker, a novel framework that fine-tunes a specialized reranker module to improve demonstration selection for LMMs. First, we assess demonstration quality by measuring its influence on model outputs. Second, our reranker incorporates a scoring head atop the CLIP embedding model, evaluating compatibility between test samples and candidate demonstrations. Third, we optimize the reranker using list-wise ranking loss while keeping the CLIP backbone frozen. Extensive experiments across 7 datasets spanning 3 multi-modal tasks confirm that DemoReranker effectively enhances LMM performance in ICL by reranking demonstrations to identify the most suitable candidates.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 13083
Loading