InfiMM-Eval: Complex Open-ended Reasoning Evaluation for Multi-modal Large Language Models

Anonymous

InfiMM-Eval: Complex Open-ended Reasoning Evaluation for Multi-modal Large Language Models

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Multi-modal Large Language Models are increasingly prominent due to their superior reasoning abilities to excel at complex tasks. Prevailing benchmarks related to multi-modal reasoning attempt to assess MLLMs through yes/no or multi-choice questions, which by design can introduce position bias and overlook the intermediate reasoning process, thereby rendering the results less convincing. To this end, we systematically categorize the reasoning tasks into deductive, abductive and analogical reasoning, and introduce InfiMM-Eval, a manually curate benchmark featuring 279 diverse and nuanced reasoning questions across these categories. The questions are designed to be fully open-ended to better represent the characteristics of generative models. To mitigate the challenge of answering complex reasoning questions, we encourage models to generate intermediate reasoning steps. These steps are incorporated into the evaluation protocol to reduce bias towards plausible guesses or responses that lack definitive answers, while facilitating the assessment of more nuanced reasoning skills. This evaluation scheme closely resembles the method by which humans evaluate exams in real-world settings, enabling a more reliable assessment. We evaluate a large selection of trending MLLMs to reveal the discrepancies in reasoning abilities between open-source and proprietary MLLMs. Additionally, we conduct a comprehensive analysis of three reasoning related factors, highlighting potential directions for further research in elevating MLLMs in reasoning tasks.

Paper Type: long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Contribution Types: Model analysis & interpretability, Data analysis

Languages Studied: English

Preprint Status: There is a non-anonymous preprint (URL specified in the next question).

A1: yes

A1 Elaboration For Yes Or No: Section 6

A2: yes

A2 Elaboration For Yes Or No: Section 7

A3: yes

A3 Elaboration For Yes Or No: Abstract and Section 1

B: yes

B1: yes

B1 Elaboration For Yes Or No: Reference Section

B2: n/a

B2 Elaboration For Yes Or No: The license and terms for use is released along with dataset

B3: yes

B3 Elaboration For Yes Or No: All dataset and models are discussed and briefly introduced in Section 2

B4: yes

B4 Elaboration For Yes Or No: Section 3

B5: n/a

B5 Elaboration For Yes Or No: all should be discussed in related works

B6: yes

B6 Elaboration For Yes Or No: Section 3

C: yes

C1: yes

C1 Elaboration For Yes Or No: Section 4

C2: yes

C2 Elaboration For Yes Or No: Section 4

C3: yes

C3 Elaboration For Yes Or No: Section 4

C4: yes

C4 Elaboration For Yes Or No: Section 4

D: yes

D1: yes

D1 Elaboration For Yes Or No: Section 3

D2: yes

D2 Elaboration For Yes Or No: Section 3

D3: yes

D3 Elaboration For Yes Or No: Section 3

D4: yes

D4 Elaboration For Yes Or No: Section 3

D5: yes

D5 Elaboration For Yes Or No: Section 3

E: yes

E1: yes

E1 Elaboration For Yes Or No: Section 4

0 Replies

Loading