InfiMM-Eval: Complex Open-ended Reasoning Evaluation for Multi-modal Large Language ModelsDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Multi-modal Large Language Models are increasingly prominent due to their superior reasoning abilities to excel at complex tasks. Prevailing benchmarks related to multi-modal reasoning attempt to assess MLLMs through yes/no or multi-choice questions, which by design can introduce position bias and overlook the intermediate reasoning process, thereby rendering the results less convincing. To this end, we systematically categorize the reasoning tasks into deductive, abductive and analogical reasoning, and introduce InfiMM-Eval, a manually curate benchmark featuring 279 diverse and nuanced reasoning questions across these categories. The questions are designed to be fully open-ended to better represent the characteristics of generative models. To mitigate the challenge of answering complex reasoning questions, we encourage models to generate intermediate reasoning steps. These steps are incorporated into the evaluation protocol to reduce bias towards plausible guesses or responses that lack definitive answers, while facilitating the assessment of more nuanced reasoning skills. This evaluation scheme closely resembles the method by which humans evaluate exams in real-world settings, enabling a more reliable assessment. We evaluate a large selection of trending MLLMs to reveal the discrepancies in reasoning abilities between open-source and proprietary MLLMs. Additionally, we conduct a comprehensive analysis of three reasoning related factors, highlighting potential directions for further research in elevating MLLMs in reasoning tasks.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Preprint Status: There is a non-anonymous preprint (URL specified in the next question).
A1: yes
A1 Elaboration For Yes Or No: Section 6
A2: yes
A2 Elaboration For Yes Or No: Section 7
A3: yes
A3 Elaboration For Yes Or No: Abstract and Section 1
B: yes
B1: yes
B1 Elaboration For Yes Or No: Reference Section
B2: n/a
B2 Elaboration For Yes Or No: The license and terms for use is released along with dataset
B3: yes
B3 Elaboration For Yes Or No: All dataset and models are discussed and briefly introduced in Section 2
B4: yes
B4 Elaboration For Yes Or No: Section 3
B5: n/a
B5 Elaboration For Yes Or No: all should be discussed in related works
B6: yes
B6 Elaboration For Yes Or No: Section 3
C: yes
C1: yes
C1 Elaboration For Yes Or No: Section 4
C2: yes
C2 Elaboration For Yes Or No: Section 4
C3: yes
C3 Elaboration For Yes Or No: Section 4
C4: yes
C4 Elaboration For Yes Or No: Section 4
D: yes
D1: yes
D1 Elaboration For Yes Or No: Section 3
D2: yes
D2 Elaboration For Yes Or No: Section 3
D3: yes
D3 Elaboration For Yes Or No: Section 3
D4: yes
D4 Elaboration For Yes Or No: Section 3
D5: yes
D5 Elaboration For Yes Or No: Section 3
E: yes
E1: yes
E1 Elaboration For Yes Or No: Section 4
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview