Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics

TMLR Paper6007 Authors

26 Sept 2025 (modified: 24 Oct 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Human perception of similarity across uni- and multi-modal inputs is highly complex, making it challenging to develop automated metrics that accurately mimic it. While general-purpose vision-language models (VLMs) like CLIP and large multi-modal models (LMMs) can serve as zero-shot perceptual metrics, they are not explicitly trained for this task. As a result, recent efforts have developed specialized models for narrow perceptual tasks. However, the extent to which these metrics align with human perception remains unclear. To address this, we introduce UniSim-Bench, a benchmark covering seven multi-modal perceptual similarity tasks across 25 datasets. Our evaluation reveals that models fine-tuned on a specific dataset struggle to generalize to unseen datasets within the same task or to related perceptual tasks. As a first step towards a unified multi-task perceptual similarity metric, we fine-tune both encoder-based and generative vision-language models on a subset of UniSim-Bench tasks. This approach achieves the highest average performance and, in some cases, surpasses task-specific models. Our comparative analysis demonstrates that encoder-based VLMs exhibit superior generalization capabilities as perceptual metrics. However, these models still struggle with unseen tasks, underscoring the challenge of developing a robust, unified metric that accurately captures the human notions of similarity.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Lei_Wang13
Submission Number: 6007
Loading