Position: Theory of Mind Benchmarks are Broken for Large Language Models

Matthew Riemer; Zahra Ashktorab; Djallel Bouneffouf; Payel Das; Miao Liu; Justin D. Weisz; Murray Campbell

Position: Theory of Mind Benchmarks are Broken for Large Language Models

Matthew Riemer, Zahra Ashktorab, Djallel Bouneffouf, Payel Das, Miao Liu, Justin D. Weisz, Murray Campbell

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 Position Paper Track posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We show that existing theory of mind benchmarks for LLMs can be deeply misleading because they inherit assumptions about self-consistent reasoning, originally developed for evaluating theory of mind in humans, that do not hold for LLMs.

Abstract: Our paper argues that the majority of theory of mind benchmarks are broken because of their inability to directly test how large language models (LLMs) adapt to new partners. This problem stems from the fact that theory of mind benchmarks for LLMs are overwhelmingly inspired by the methods used to test theory of mind in humans and fall victim to a fallacy of attributing human-like qualities to AI agents. We expect that humans will engage in a consistent reasoning process across various questions about a situation, but this is known to not be the case for current LLMs. Most theory of mind benchmarks only measure what we call *literal theory of mind*: the ability to predict the behavior of others. However, this type of metric is only informative when agents exhibit self-consistent reasoning. Thus, we introduce the concept of *functional theory of mind*: the ability to adapt to agents in-context following a rational response to their behavior. We find that many open source LLMs are capable of displaying strong literal theory of mind capabilities, but seem to struggle with functional theory of mind -- even with exceedingly simple partner policies. Simply put, strong literal theory of mind performance does not necessarily imply strong functional theory of mind performance or vice versa. Achieving functional theory of mind, particularly over long interaction horizons with a partner, is a significant challenge deserving a prominent role in any meaningful LLM theory of mind evaluation.

Lay Summary: Our paper argues that most current benchmarks for understanding whether large language models (LLMs) display theory of mind are flawed. This is because they implicitly assume that LLMs think like humans do when this is not the case. For example, we often expect people to be consistent when answering different questions about the same situation. Today's AI models, however, don’t always reason in such consistent ways. Many of the existing benchmarks only check whether a model can guess what someone will do in a given situation. But this kind of test only works if the AI reasons in a self-consistent manner. We propose to instead evaluate LLMs by seeing whether they can adjust to the behavior of others over time, much like people do in real conversations. We call this more practical ability *functional theory of mind.* We find that while many AI models do well on existing benchmarks, they often fail to adapt their behavior to new partners, even in very simple settings. In short, being good at predicting the actions of others doesn't mean the LLMs truly understands or responds flexibly to them. Truly achieving this deeper kind of understanding is a major challenge — but one that's crucial if we want AI that can interact meaningfully with people.

Primary Area: Research Priorities, Methodology, and Evaluation

Keywords: Theory of Mind, Multi-agent, LLMs

Submission Number: 206

Loading