Towards an Exhaustive Evaluation of Vision-Language Foundation Models

Emmanuelle Salin, Stéphane Ayache, Benoît Favre

Published: 2023, Last Modified: 17 Mar 2026ICCV (Workshops) 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Vision-language foundation models have had considerable increase in performances in the last few years. However, there is still a lack comprehensive evaluation methods able to clearly explain their performances. We argue that a more systematic approach to foundation model evaluation would be beneficial to their use in real-world applications. In particular, we think that those models should be evaluated on a broad range of precise capabilities, in order to bring awareness to the width of their scope and their potential weaknesses. To that end, we propose a methodology to build a taxonomy of multimodal capabilities for vision-language foundation models. The proposed taxonomy is intended as a first step towards an exhaustive evaluation of vision-language foundation models.
Loading