Hi-ToM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models

Yufan Wu; Yinghui He; Yilin Jia; Rada Mihalcea; Yulong Chen; Naihao Deng

Hi-ToM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models

Yufan Wu, Yinghui He, Yilin Jia, Rada Mihalcea, Yulong Chen, Naihao Deng

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 FindingsEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Theme Track: Large Language Models and the Future of NLP

Submission Track 2: Language Modeling and Analysis of Language Models

Keywords: Higher Order Theory of Mind, Chain of Thought Prompting, Large Language Models, Deception

TL;DR: We introduce a benchmark called Hi-ToM for evaluating higher-order ToM reasoning in LLMs, and do experimental evaluation on GPT-4, GPT-3.5, Claude and Guanaco.

Abstract: Theory of Mind (ToM) is the ability to reason about one's own and others' mental states. ToM plays a critical role in the development of intelligence, language understanding, and cognitive processes. While previous work has primarily focused on first and second-order ToM, we explore higher-order ToM, which involves recursive reasoning on others' beliefs. %We also incorporate a new deception mechanism in ToM reasoning. We introduce Hi-ToM, a Higher Order Theory of Mind benchmark. Our experimental evaluation using various Large Language Models (LLMs) indicates a decline in performance on higher-order ToM tasks, demonstrating the limitations of current LLMs. We conduct a thorough analysis of different failure cases of LLMs, and share our thoughts on the implications of our findings on the future of NLP.

Submission Number: 2311

Loading