Agreement with the Ensemble for Zero-Shot Vision-Language Model Selection

14 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language Models, model selection
Abstract: Pretrained Vision-Language Models (VLMs) such as CLIP are well known for enabling zero-shot classification with category names. The rapid growth of open-access variants has led to a diverse VLM zoo, where selecting the most suitable model can yield superior zero-shot performance, yet the optimal choice is often dataset-dependent. At the same time, selecting VLMs for zero-shot tasks is challenging, since only category names are available and target images are absent. Prior approaches rely on text-only evaluation, which suffers from the modality gap inherent to VLMs. To address this issue, we propose SAGE (Selection via AGreement-with-the-Ensemble), which leverages in-the-wild images to bridge the modality gap. Specifically, SAGE quantifies the agreement between individual VLMs and their ensemble counterparts in terms of prediction behavior on in-the-wild images. Experiments demonstrate that SAGE consistently outperforms state-of-the-art zero-shot VLM selection methods.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 5212
Loading