Do LLMs selectively encode the goal of an agent's reach?
Keywords: LLM, agents, theory of mind, language, Woodward
TL;DR: All tested models appear to represent text with animate and inanimate actors differently, but only GPT-3.5-turbo and GPT-4 selectively encode an agent's goal so that they do not fail on our control task where animate actors act accidentally.
Abstract: In this work, we investigate whether large language models (LLMs) exhibit one of the earliest Theory of Mind-like behaviors: selectively encoding the goal object of an actor's reach (Woodward, 1998). We prompt state-of-the-art LLMs with ambiguous examples that can be explained both by an object or a location being the goal of an actor's reach, and evaluate the model's bias. We compare the magnitude of the bias in three situations: i) an agent is acting purposefully, ii) an inanimate object is acted upon, and iii) an agent is acting accidentally. We find that two models show a selective bias for agents acting purposefully, but are biased differently than humans. Additionally, the encoding is not robust to semantically equivalent prompt variations. We discuss how this bias compares to the bias infants show and provide a cautionary tale of evaluating machine Theory of Mind (ToM). We release our dataset and code.
Supplementary Material: pdf
Submission Number: 44