BagelScore: Visual-Language Evaluation Made Easy

ICLR 2026 Conference Submission24796 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Learning, Image Editing
Abstract: Evaluation remains a fundamental challenge in multimodal learning. Existing metrics such as CLIPScore, LPIPS, and FID reduce assessment to embedding similarity or perceptual distance, which systematically fails to capture semantic correctness or editing plausibility, while GPT-based scoring remains subjective and inconsistent. We argue that the emergence of bottleneck-free unified multimodal models enables a new evaluation paradigm: their internal reasoning and generative dynamics can serve as principled signals. Building on BAGEL, we propose two complementary metrics. BagelScore focuses on image understanding and image-text matching, outperforming traditional metrics like CLIPScore, LPIPS, FID, and GPT-based heuristics by directly evaluating the semantic alignment between images and captions using the unified model's reasoning capabilities. EditingScore, the first evaluation metric specifically designed for assessing image editing quality, quantifies the difficulty of learning the transformation in the latent space of a generative model. EditingScore is validated on Edit-1K, the first benchmark dataset specifically created for image editing quality evaluation. Together, BagelScore and EditingScore provide a unified, reasoning-based paradigm for multimodal evaluation.
Supplementary Material: pdf
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 24796
Loading