Keywords: Benchmark, Multimodal Large Language Models
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks.
However, their capacity to comprehend human-centric scenes has rarely been explored, primarily due to the absence of comprehensive evaluation benchmarks that take into account both the human-oriented granular level and higher-dimensional causal reasoning ability.
Such high-quality evaluation benchmarks face tough obstacles, given the physical complexity of the human body and the difficulty of annotating granular structures.
In this paper, we propose Human-MME, a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding. Compared with other existing benchmarks, our work provides three key features:
**(1) Diversity in human scene**, spanning 4 primary
visual domains with 15 secondary domains and 43 sub-fields to ensure broad scenario
coverage.
**(2) Progressive and diverse evaluation dimensions**, evaluating the human-based activities progressively from the human-oriented granular perception to the higher-dimensional multi-target and causal reasoning, consisting of eight dimensions with 19,945 real-world image question pairs and an evaluation suite.
**(3) High-quality annotations with rich data paradigms**, constructing the automated annotation pipeline and human-annotation platform, supporting rigorous manual labeling by expert annotators to facilitate precise and reliable model assessment. Our benchmark extends the single-person and single-image understanding to the multi-person and multi-image mutual understanding by constructing the choice, short-answer, grounding, ranking and judgment question components, and complex question-answer pairs of their combination. The extensive experiments on 20 state-of-the-art MLLMs effectively expose the limitations and guide future MLLMs research toward better human-centric image understanding and reasoning. Data and code are available at [https://github.com/Yuan-Hou/Human-MME](https://github.com/Yuan-Hou/Human-MME).
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 3048
Loading