ViDi: A Benchmark for the Identification and Captioning of Visual Differences in Image Pairs

ACL ARR 2025 February Submission180 Authors

04 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have brought forward capabilities such as handling multiple images or engaging in multi-turn conversations involving images. In this paper, we present a challenging \textit{spot the differences} benchmark for evaluating MLLMs, geared toward the identification and captioning of Visual Differences (ViDi)\footnote{The dataset is available at \url{https://anonymous.4open.science/r/ViDi-FC45/}}. The benchmark corresponds to a test-only dataset with 200 pairs of images, along with human-annotated descriptions of visually noticeable differences for each pair. ViDi goes beyond the identification of single differences, challenging models to articulate changes using natural language, while pinpointing the subject of each difference and its absolute or relative location. Empirical results reveal that MLLMs are still in an early development stage, showing limited performance in \textit{spot the differences} tasks.
Paper Type: Short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality; evaluation methodologies; reasoning; vision question answering;
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 180
Loading