MoMentS: A Comprehensive Multimodal Benchmark for Theory of Mind

ACL ARR 2025 May Submission874 Authors

15 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Understanding Theory of Mind is essential for building socially intelligent multimodal agents capable of perceiving and interpreting human behavior. We introduce \textsc{MoMentS} (Multi\textbf{mo}dal \textbf{Ment}al \textbf{S}tates), a comprehensive benchmark designed to assess the ToM capabilities of multimodal large language models (LLMs) through realistic, narrative-rich scenarios presented in short films. \textsc{MoMentS} includes over 2,344 multiple-choice questions spanning seven distinct ToM categories. The benchmark features long video context windows and realistic social interactions that provide deeper insight into characters’ mental states. While the visual modality generally enhances model performance, current systems still struggle to integrate it effectively, underscoring the need for further research into AI’s multimodal understanding of human behavior.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision question answering, human behavior analysis, multimodal QA, corpus creation, benchmarking
Contribution Types: Data resources
Languages Studied: English
Submission Number: 874
Loading