MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image Relational Association Capabilities in Large Visual Language Models

ACL ARR 2024 December Submission1227 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Current multi-modal benchmarks primarily focus on facts or specific topic-related knowledge within individual images. However, they overlook the associative relations between multiple images, which require identifying and analyzing similarities among entities or content present in different images. Therefore, we propose the multi-image relation association task and a meticulously curated \textbf{M}ulti-granularity \textbf{M}ulti-image \textbf{R}elational \textbf{A}ssociation (\textbf{MMRA}) benchmark, comprising \textbf{1,024} samples. In order to systematically and comprehensively evaluate current LVLMs, we establish an associational relation system among images that contain \textbf{11 subtasks} (e.g, UsageSimilarity, SubEvent, etc.) at two granularity levels (i.e., ``\textbf{image}'' and ``\textbf{entity}'') according to the relations in ConceptNet. Our experiments reveal that entity-level multi-image perception tasks pose a greater challenge for LVLMs compared to image-level tasks. Moreover, LVLMs perform poorly on spatial-related tasks, indicating that LVLMs have limited spatial awareness. Moreover, we explored the ability of LVLMs to perceive image sequences, and our experiments show that the majority of current LVLMs do not adequately model image sequences during the pre-training process.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Multi-image Association, benchmark
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 1227
Loading