LOVM: Language-Only Vision Model Selection

Published: 01 Nov 2023, Last Modified: 12 Dec 2023R0-FoMo PosterEveryoneRevisionsBibTeX
Keywords: Multi-modal models, Language-Vision Models, Foundation Models, Transferability, Model Selection
TL;DR: We introduce a novel task where we select an optimal VLM given only a text description of the application.
Abstract: Pre-trained multi-modal vision-language models (VLMs) excel in downstream applications, especially in the few- and zero-shot settings. However, choosing the optimal VLM for some downstream applications is challenging due to task and dataset dependencies. Exhaustive evaluation of all VLMs is impractical and requires the collection of a labeled dataset for evaluation. As the number of open-source VLM variants increases, there is a need for an efficient model selection strategy that does not require access to a curated evaluation dataset. To address this, we introduce a novel task, LOVM: **L**anguage-**O**nly **V**ision **M**odel Selection, where methods are expected to perform both model selection and performance prediction based solely on a text description of the desired downstream application. We also present an extensive LOVM benchmark consisting of ground-truth evaluations of 23 pre-trained VLMs and 35 datasets, enabling effective ranking and performance prediction of VLMs. Our code, full paper, and dataset are available at https://github.com/orrzohar/LOVM.
Submission Number: 45
Loading