Active In-Context Learning: Enhancing the Generalization of Large Multimodal Models

Fan Wang; Zhongyi Han; Guanglin Zhou; Rundong He; Hao Sun; Yicong Dong; Wan su; Xin Gao; Yilong Yin

Active In-Context Learning: Enhancing the Generalization of Large Multimodal Models

Fan Wang, Zhongyi Han, Guanglin Zhou, Rundong He, Hao Sun, Yicong Dong, Wan su, Xin Gao, Yilong Yin

23 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Multimodal Models, In-context Learning, Active Learning

TL;DR: Active In-Context Learning dynamically selects and annotates a small, evolving active set during testing, improving LMMs performance without relying on large pre-annotated training sets.

Abstract: The performance of Large Multimodal Models (LMMs) on downstream tasks improves substantially when examples of visual-text relationships are incorporated as context, with performance gains increasing as the number of examples and the context window size grow. However, collecting high-quality training sets for In-Context Learning (ICL) to retrieve multimodal examples is not trivial, particularly in specialized domains like healthcare, remote sensing, finance, and scientific research, due to the significant costs of manual labeling and strict privacy regulations. In this paper, we introduce Active In-Context Learning (AICL), a novel paradigm that eliminates the need for traditional training sets in multimodal ICL. AICL dynamically selects and annotates a small, highly informative set of samples in real-time during the query phase of LMMs. This active set evolves throughout querying, with the most relevant examples being continuously retrieved from it to optimize LMM performance on new data, without relying on pre-existing training sets. To construct an optimal active set, we propose Spectral-based Representative Sampling, which applies spectral clustering in the early query phase to select samples that are early, class-balanced, and representative, ensuring the active set captures key features of the data distribution and reduces data bias. To fully leverage the active set, we propose Similarity-enhanced TopK Prompt Construction, which retrieves the most relevant multimodal examples using a TopK similarity strategy and integrates the visual similarities between the multimodal examples and the query samples directly into the text prompts. By incorporating this similarity information, LMMs can better grasp the relationships, leading to more accurate and context-aware predictions. Experimental results on 10 specialized datasets and four LMMs show that our method significantly enhances LMMs’ generalization performance. For example, in medical diagnosis tasks, our method, using only 10 annotated samples in the active set, outperforms existing ICL methods that rely on 2,000 annotated training samples.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2978

Loading