Keywords: Theory of Mind, Benchmark, Social Reasoning, Large Language Models, Reasoning
Abstract: As large language models (LLMs) are increasingly involved in human society, some studies try to evaluate LLMs' capability of theory of mind (ToM), which is about the understanding and reasoning of others' mental states and possible actions. However, these previous works simplify the ToM capability required in real social contexts during their evaluations. This can be reflected in three aspects: (1) most evaluations focus on a **static mental state** after several social scenarios while ignoring the changes of mental states across different scenarios; (2) they mainly consider **independent mental states**, however different kinds of mental states (beliefs, intentions, and emotions) and actions can influence one another in our real life; (3) there is an **absence of social settings and character profiles** in their evaluation, even though humans can effortlessly obtain and utilize this information in ToM reasoning processes. This lack can underestimate the abilities of LLMs. This paper aims to evaluate LLMs' ToM capability in closer alignment with a realistic social context.
Correspondingly, we propose a new benchmark, named **ToMValley**, which alleviates the limitations mentioned above of previous works. Specifically, the benchmark is constructed using a framework that includes four steps: social background determination, mental state sketch, social scenario design, and rule-based question generation. Overall, there are 1100 social contexts and 78100 questions about characters' mental states. The quality of the benchmark is manually verified. Additionally, we evaluate ten popular LLMs on **ToMValley**. Experimental results suggest that LLMs' performances are significantly inferior to human levels by 11\%. Subsequent investigation indicates that LLMs are ineffective at interpreting alterations in mental states across social scenarios. Furthermore, we observe that LLMs are incapable of addressing compositional questions that necessitate multi-hop reasoning within the social context.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12447
Loading