HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding

yuxuan cai; Jiangning Zhang; Zhenye Gan; Qingdong He; Xiaobin Hu; Junwei Zhu; Yabiao Wang; Chengjie Wang; Zhucun Xue; Chaoyou Fu; Xinwei He; Xiang Bai

HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding

yuxuan cai, Jiangning Zhang, Zhenye Gan, Qingdong He, Xiaobin Hu, Junwei Zhu, Yabiao Wang, Chengjie Wang, Zhucun Xue, Chaoyou Fu, Xinwei He, Xiang Bai

06 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: MLLM Benchmark, Video Understanding

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos. However, their capacity to comprehend human-centric video data remains underexplored, primarily due to the absence of comprehensive and high-quality evaluation benchmarks. Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios. Furthermore, they are often limited by single-question paradigms and overly simplistic evaluation metrics. To address above limitations, we propose a modern HV-MMBench, a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding. Compared to existing human-centric video benchmarks, our work offers the following key features: (1) Diverse evaluation dimensions: HV-MMBench encompasses 15 tasks, ranging from basic attribute perception (e.g., age estimation, emotion recognition) to advanced cognitive reasoning (e.g., social relationship prediction, intention prediction), enabling comprehensive assessment of model capabilities; (2) Varied data types: The benchmark includes multiple-choice, fill-in-blank, true/false, and open-ended question formats, combined with diverse evaluation metrics, to more accurately and robustly reflect model performance; (3) Multi-domain video coverage: The benchmark spans 50 distinct visual scenarios, enabling comprehensive evaluation across fine-grained scene variations; (4) Temporal coverage: The benchmark covers videos from short-term (10 seconds) to long-term (up to 30min) durations, supporting systematic analysis of models’ temporal reasoning abilities across diverse contextual lengths. We evaluate several advanced open-source MLLMs on the HV-MMBench. While models excel in closed-form tasks, their performance drops sharply in open-ended generation, revealing a reliance on shallow patterns over genuine reasoning. In contrast, cloze and open-ended formats better expose reasoning challenges in human behavior understanding. By spanning diverse tasks and paradigms, HV-MMBench systematically reveals these limitations and facilitates the MLLM development.

Primary Area: datasets and benchmarks

Submission Number: 2636

Loading