HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks

20 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Automated Data Synthesis, Multiple-Choice QA Synthesis ,Human-Centric Video Understanding,Video benchmarks, Multimodal Large Language Models
TL;DR: HumanVBench pioneers rigorous video evaluation using a scalable, automated pipeline to generate 16 fine-grained tasks, drastically reducing manual labor and offering a generalizable creation framework for MLLM assessment.
Abstract: Evaluating the nuanced human-centric video understanding capabilities of Multimodal Large Language Models (MLLMs) remains a great challenge, as existing benchmarks often overlook the intricacies of emotion, behavior, and cross-modal alignment. We introduce HumanVBench, a comprehensive video benchmark designed to rigorously probe these capabilities across 16 fine-grained tasks. A cornerstone of our work is a novel and scalable benchmark construction methodology, featuring two automated pipelines that synthesize high-quality video annotations and challenging multiple-choice questions with minimal human labor. By leveraging state-of-the-art models for annotation and systematically converting model-induced errors into plausible distractors, our framework provides a generalizable ``machine'' for creating nuanced evaluation suites. Our extensive evaluation of 27 leading MLLMs on HumanVBench reveals critical deficiencies, particularly in perceiving subtle emotions and aligning speech with visual cues, with even top proprietary models falling short of human performance. We open-source HumanVBench and our synthesis pipelines to catalyze the development of more socially intelligent and capable video MLLMs.
Supplementary Material: pdf
Primary Area: datasets and benchmarks
Submission Number: 23192
Loading