Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics

Sara Ghazanfari; Siddharth Garg; Nicolas Flammarion; Prashanth Krishnamurthy; Farshad Khorrami; Francesco Croce

Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics

Sara Ghazanfari, Siddharth Garg, Nicolas Flammarion, Prashanth Krishnamurthy, Farshad Khorrami, Francesco Croce

Published: 07 May 2025, Last Modified: 29 May 2025VisCon 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Perceptual Metric, Vision-Language Model, Large Multimodal Model

Abstract: Human perception of similarity across uni- and multi-modal inputs is highly complex, making it challenging to develop automated metrics that accurately mimic it. While general-purpose vision-language models (VLMs) like CLIP and large multi-modal models (LMMs) can serve as zero-shot perceptual metrics, they are not explicitly trained for this task. As a result, recent efforts have developed specialized models for narrow perceptual tasks. However, the extent to which these metrics align with human perception remains unclear. To address this, we introduce UniSim-Bench, a benchmark covering seven multi-modal perceptual similarity tasks across 25 datasets. Our evaluation reveals that models fine-tuned on a specific dataset struggle to generalize to unseen datasets within the same task or to related perceptual tasks. As a first step towards a unified multi-task perceptual similarity metric, we fine-tune both encoder-based and generative vision-language models on a subset of UniSim-Bench tasks. This approach achieves the highest average performance and, in some cases, surpasses task-specific models, showing the viability of a unified perceptual metric. Moreover, our comparative analysis demonstrates that encoder-based VLMs exhibit superior generalization capabilities as perceptual metrics.

Submission Number: 8

Loading