Evaluating Graphical Perception of Large Multimodal Models

Kai Zhang; Jianwei Yang; Jeevana Priya Inala; Chandan Singh; Jianfeng Gao; Yu Su; Chenglong Wang

Evaluating Graphical Perception of Large Multimodal Models

Kai Zhang, Jianwei Yang, Jeevana Priya Inala, Chandan Singh, Jianfeng Gao, Yu Su, Chenglong Wang

26 Sept 2024 (modified: 12 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Multimodal Models, Graphical Perception, Evaluation

TL;DR: Based on graphical perception theory, we propose an automated framework to evaluate large multimodal models, including GPT-4o, and identify their limitations at the chart type, visual element, and pixel levels.

Abstract: Despite the promising results of large multimodal models (LMMs) in various vision-language tasks, recent benchmarks reveal that these models can struggle with low-level chart perception tasks that require precision. However, since existing benchmarks primarily focus on end tasks that evaluate models' knowledge and reasoning abilities all together, they provide limited fine-grained insights into how the models' perception abilities affect their performance in chart tasks. To address this gap, we leverage *the theory of graphical perception*, an approach used to study how humans decode visual information encoded on charts and graphs, to develop an evaluation framework for analyzing gaps in LLMs' perception abilities in charts. With automated task generation and response evaluation designs, our framework enables comprehensive and controlled testing of LMMs' graphical perception across diverse chart types, visual elements, and task types. We apply our framework to evaluate the perception capabilities of state-of-the-art LMMs at three granularity levels (chart, visual element, and pixel). Our findings underscore several critical limitations of current state-of-the-art LMMs, including GPT-4o: their inability to (1) generalize across chart types, (2) understand fundamental visual elements, and (3) cross reference values within a chart. These insights provide guidance for future improvements in perception abilities of LMMs. The evaluation framework and labeled data will be publicly available upon acceptance.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8075

Loading