Ranked from Within: Ranking Large Multimodal Models Without Labels

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Can the relative performance of a pre-trained large multimodal model (LMM) be predicted without access to labels? As LMMs proliferate, it becomes increasingly important to develop efficient ways to choose between them when faced with new data or tasks. The usual approach does the equivalent of giving the models an exam and marking them. We opt to avoid marking and the associated labor of determining the ground-truth answers. Instead, we explore other signals elicited and ascertain how well the models know their own limits, evaluating the effectiveness of these signals at unsupervised model ranking. We evaluate 47 state-of-the-art LMMs (e.g., LLaVA) across 9 visual question answering benchmarks, analyzing how well uncertainty-based metrics can predict relative model performance. Our findings show that uncertainty scores derived from softmax distributions provide a robust and consistent basis for ranking models across various tasks. This facilitates the ranking of LMMs on unlabeled data, providing a practical approach for selecting models for diverse target domains without requiring manual annotation.
Lay Summary: Large AI models that understand both images and text are being used more and more, but it can be hard to know which one works best for a new task — especially when we don’t have labeled data to test them. Usually, people compare models by checking their answers against correct ones. But finding or creating those correct answers takes time and effort. And if we only have the questions — with no answers — there’s no easy way to tell which model to trust. Should we just pick one at random? In this work, we look at a different approach. Instead of checking answers, we see how confident each model is in its responses and use that to guess how well it might perform. We tested this idea using 47 of the latest models on nine different tasks. We found that a model’s confidence — measured in a specific way — can often predict how good it is, even without knowing the correct answers. This could help people quickly choose the right model for their needs without spending time labeling data.
Primary Area: Social Aspects->Robustness
Keywords: Large Multimodal Models, visual question answering
Submission Number: 9130
Loading