PairBench: Are Vision-Language Models Reliable at Comparing What They See?

PairBench: Are Vision-Language Models Reliable at Comparing What They See?

ICLR 2026 Conference Submission20273 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: benchmark, vision language models, multimodal models

TL;DR: We introduce a benchmark to evaluate how good models are in terms of comparing data. Our evaluations highlight limitations of state-of-the-art models and point towards how to improve them.

Abstract: Understanding how effectively large vision language models (VLMs) compare visual inputs is crucial across numerous applications, yet this fundamental capability remains insufficiently assessed. While VLMs are increasingly deployed for tasks requiring comparative judgment, including automated evaluation, re-ranking, and retrieval-augmented generation, no systematic framework exists to measure their performance in these scenarios. We present PairBench, a simple framework that evaluates VLMs as customizable similarity tools using widely available image datasets. Our approach introduces four key metrics for reliable comparison: alignment with scores derived from human annotations, consistency across pair ordering, distribution smoothness, and controllability through prompting. Our analysis reveals that no model consistently excels across all metrics, with each demonstrating distinct strengths and weaknesses. Most concerning is the widespread inability of VLMs to maintain symmetric similarity scores. Interestingly, we demonstrate that performance on our benchmark strongly correlates with popular benchmarks used for complex reasoning tasks, while providing additional insight into controllability, smoothness and ordering. This makes PairBench a unique and comprehensive framework to evaluate the performance of VLMs for automatic evaluation, while offering an efficient predictor of model capabilities for more complex tasks.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 20273

Loading