Keywords: benchmark, vision language models, multimodal models
TL;DR: We introduce a benchmark to evaluate how good models are in terms of comparing data. Our evaluations highlight limitations of state-of-the-art models and point towards how to improve them.
Abstract: Understanding how effectively large vision language models (VLMs) compare visual inputs is crucial across numerous applications, yet this fundamental capability remains insufficiently assessed. While VLMs are increasingly deployed for tasks requiring comparative judgment, including automated evaluation, re-ranking, and retrieval-augmented generation, no systematic framework exists to measure their performance in these scenarios. We present PairBench, a simple framework that evaluates VLMs as customizable similarity tools using widely available image datasets. Our approach introduces four key metrics for reliable comparison: alignment with scores derived from human annotations, consistency across pair ordering, distribution smoothness, and controllability through prompting. Our analysis reveals that no model consistently excels across all metrics, with each demonstrating distinct strengths and weaknesses. Most concerning is the widespread inability of VLMs to maintain symmetric similarity scores. Interestingly, we demonstrate that performance on our benchmark strongly correlates with popular benchmarks used for complex reasoning tasks, while providing additional insight into controllability, smoothness and ordering. This makes PairBench a unique and comprehensive framework to evaluate the performance of VLMs for automatic evaluation, while offering an efficient predictor of model capabilities for more complex tasks.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 20273
Loading