RPTS-Eval: Evaluating Large Vision-Language Models for Reasoning Process and Scoring with Tree

RPTS-Eval: Evaluating Large Vision-Language Models for Reasoning Process and Scoring with Tree

ACL ARR 2024 December Submission2284 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Vision-Language Models (LVLMs) excel in multimodal reasoning, and have shown impressive performance across various multimodal benchmarks. However, most of these benchmarks evaluate models primarily through multiple-choice or short-answer formats, which do not take the reasoning process into account. Although some benchmarks do assess the reasoning process, their methods are often too simplistic and only examine reasoning when answers are incorrect. This approach overlooks scenarios where flawed reasoning leads to correct answers. In addition, these benchmarks do not consider the impact of inter-modal relationships on reasoning. To address this issue, we propose RPTS-Eval, a benchmark focused on meticulously evaluating the reasoning process of models. RPTS-Eval comprises 374 images and 390 reasoning instances, covering 6 types of vision-language capabilities. We also introduce a new evaluation metric called RPTS to provide a fine-grained reflection of the reasoning process, which can not only indicate the overall correctness of the reasoning but also pinpoint the specific step where the model makes an error. We evaluated representative LVLMs (e.g., GPT-4o, Llava-Next), uncovering their limitations in multimodal reasoning and highlighting the differences between open-source and closed-source commercial LVLMs. We believe that this benchmark will contribute to advancing research in the field of multimodal reasoning.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond, Resources and Evaluation

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English, Chinese

Submission Number: 2284

Loading