Can We Predict Performance of Large Models across Vision-Language Tasks?

Qinyu Zhao; Ming Xu; Kartik Gupta; Akshay Asthana; Liang Zheng; Stephen Gould

Can We Predict Performance of Large Models across Vision-Language Tasks?

Qinyu Zhao, Ming Xu, Kartik Gupta, Akshay Asthana, Liang Zheng, Stephen Gould

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: A comprehensive evaluation of 108 LVLMs on 36 benchmarks and a framework for predicting LVLM performances using PMF with MCMC, showing interesting applications and observations.

Abstract: Evaluating large vision-language models (LVLMs) is very expensive, due to high computational cost and the wide variety of tasks. The good news is that if we already have some observed performance scores, we may be able to infer unknown ones. In this study, we propose a new framework for predicting unknown performance scores based on observed ones from other LVLMs or tasks. We first formulate the performance prediction as a matrix completion task. Specifically, we construct a sparse performance matrix $\boldsymbol{R}$, where each entry $R_{mn}$ represents the performance score of the $m$-th model on the $n$-th dataset. By applying probabilistic matrix factorization (PMF) with Markov chain Monte Carlo (MCMC), we can complete the performance matrix, i.e., predict unknown scores. Additionally, we estimate the uncertainty of performance prediction based on MCMC. Practitioners can evaluate their models on untested tasks with higher uncertainty first, which quickly reduces the prediction errors. We further introduce several improvements to enhance PMF for scenarios with sparse observed performance scores. Our experiments demonstrate the accuracy of PMF in predicting unknown scores, the reliability of uncertainty estimates in ordering evaluations, and the effectiveness of our enhancements for handling sparse data. Our code is available at https://github.com/Qinyu-Allen-Zhao/CrossPred-LVLM.

Lay Summary: Evaluating how well large vision-language models (LVLMs) perform on a wide range of tasks, like answering questions about images or describing scenes, can be extremely expensive. Each test on these large-scale models requires time, money, and computing resources. But do we really need to test every model-task pair? In our study, we show that if we already know how a model performs on some tasks, we can predict how it might perform on others, using a mathematical technique called probabilistic matrix factorization. Even better, our method can also estimate how confident it is in its predictions. For example, if our method is uncertain about GPT-4's performance on 3D understanding but confident about LLaVA's performance on object recognition, we can prioritize evaluating GPT-4 on the 3D task when our resources are limited. We hope our framework can help to develop and improve LVLMs more efficiently. You can explore our code here: https://github.com/Qinyu-Allen-Zhao/CrossPred-LVLM.

Link To Code: https://github.com/Qinyu-Allen-Zhao/CrossPred-LVLM

Primary Area: Deep Learning->Large Language Models

Keywords: Large Vision-Language Models (LVLMs), Benchmarking, Probabilistic Matrix Factorization (PMF), Markov Chain Monte Carlo (MCMC), Active Evaluation

Submission Number: 8154

Loading