MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image Relational Association Capabilities in Large Visual Language Models

MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image Relational Association Capabilities in Large Visual Language Models

ACL ARR 2025 May Submission3050 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Current multi-modal benchmarks primarily focus on facts within individual images. However, they overlook the associative relations among multiple images, which necessitate conduct \textbf{commonsense reasoning} grounded in the associated knowledge at different granularities (i.e., ``\textbf{image}'' and ``\textbf{entity}'') and the ability to perceive \textbf{image order}. % Moreover, they do not explore the model performance across tasks of varying granularity, which . Therefore, we propose the multi-image relation association task and a meticulously curated \textbf{M}ulti-granularity \textbf{M}ulti-image \textbf{R}elational \textbf{A}ssociation (\textbf{MMRA}) benchmark, comprising 1,024 samples. In order to systematically evaluate current LVLMs, we establish an associational relation system among images that contain \textbf{11 subtasks} (e.g, UsageSimilarity, SubEvent, etc.) at two granularity levels (i.e., ``\textbf{image}'' and ``\textbf{entity}'') according to the relations in ConceptNet. Our experiments reveal that entity-level multi-image perception tasks pose a greater challenge for LVLMs compared to image-level tasks. Moreover, LVLMs perform poorly on spatial-related tasks, indicating that LVLMs have limited spatial awareness. Furthermore, we find that the LVLMs' \textbf{image order perception} capability is relatively poor and design a method to significantly improve the ability of LVLMs, which demonstrates that the majority of current LVLMs do not adequately consider image order perception during the pre-training process.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: LVLMs, multiple images

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 3050

Loading