MoMentS: A Comprehensive Multimodal Benchmark for Theory of Mind

MoMentS: A Comprehensive Multimodal Benchmark for Theory of Mind

ACL ARR 2025 May Submission874 Authors

15 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Understanding Theory of Mind is essential for building socially intelligent multimodal agents capable of perceiving and interpreting human behavior. We introduce \textsc{MoMentS} (Multi\textbf{mo}dal \textbf{Ment}al \textbf{S}tates), a comprehensive benchmark designed to assess the ToM capabilities of multimodal large language models (LLMs) through realistic, narrative-rich scenarios presented in short films. \textsc{MoMentS} includes over 2,344 multiple-choice questions spanning seven distinct ToM categories. The benchmark features long video context windows and realistic social interactions that provide deeper insight into characters’ mental states. While the visual modality generally enhances model performance, current systems still struggle to integrate it effectively, underscoring the need for further research into AI’s multimodal understanding of human behavior.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: vision question answering, human behavior analysis, multimodal QA, corpus creation, benchmarking

Contribution Types: Data resources

Languages Studied: English

Submission Number: 874

Loading