Towards Holistic Evaluation of MLLMs for Embodied Decision-Making in Complex Human-Centered Situations

Towards Holistic Evaluation of MLLMs for Embodied Decision-Making in Complex Human-Centered Situations

ACL ARR 2025 May Submission2774 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multimodal Large Language Models (MLLMs) show promising results for embodied agents in operating meaningfully in complex, human-centered environments. Yet, evaluating their capacity for nuanced, human-like reasoning and decision-making remains challenging. We hence introduce HRDBench, a cognitively grounded benchmark for evaluating Human-centered Embodied Reasoning and Decision-making in MLLMs .HRDBench consists of 1,113 real-world situations paired with 6,126 multiple-choice questions, targeting three core abilities for decision-making: (1) Foundational Situation Comprehension, (2) Context-Driven Action Justification, and (3) Reflective Reasoning. Together, these dimensions provide a holistic framework for assessing a model’s ability to perceive, reason, and act in socially meaningful ways. We evaluate the state-of-the-art commercial and open-source models on \benchmark, where we reveal distinct performance patterns and highlight significant challenges. Our in-depth analysis further offers insights into current model limitations and supports the development of MLLMs with more robust, context-aware, and socially adept embodied decision-making capabilities for real-world scenarios.

Paper Type: Long

Research Area: Human-Centered NLP

Research Area Keywords: human-centered evaluation, human-subject application-grounded evaluations, multimodality, benchmarking

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English

Keywords: human-centered decision-making, benchmarking

Submission Number: 2774

Loading