Human-Centric Framework for Large Multimodal Models Evaluation

Shaina Raza; Aravind Narayanan; Vahid Reza Khazaie; Ashmal Vayani; Mukund Sayeeganesh Chettiar; Deval Pandya

Human-Centric Framework for Large Multimodal Models Evaluation

Shaina Raza, Aravind Narayanan, Vahid Reza Khazaie, Ashmal Vayani, Mukund Sayeeganesh Chettiar, Deval Pandya

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: human centric evaluation, LLM, multimodals

TL;DR: Human-Centric evaluations of multimodals

Abstract: Large multimodal models (LMMs) have been widely tested on tasks like visual question answering (VQA), image captioning, and grounding, but lack rigorous evaluation for alignment with human-centered (HC) values such as fairness, ethics, and inclusivity. To address this gap, we introduce \textbf{HumaniBench}, a novel benchmark of 32,000 real-world image-question pairs and an evaluation suite. Labels are generated via an AI-assisted pipeline and validated by experts. HumaniBench assesses LMMs across seven key alignment principles: fairness, ethics, empathy, inclusivity, reasoning, robustness, and multilinguality, through open-ended and closed-ended VQA tasks. Grounded in AI ethics and real-world needs, these principles provide a holistic lens for societal impact. Benchmarking results on different LMM shows that proprietary models generally lead in reasoning, fairness, and multilinguality, while open-source models excel in robustness and grounding. Most models struggle to balance accuracy with ethical and inclusive behavior. HumaniBench offers a rigorous testbed to diagnose limitations, and promote responsible LMM development. Code and data are available for reproducibility.

Submission Number: 100

Loading