Do LLMs selectively encode the goal of an agent's reach?

Laura Ruis; Arduin Findeis; Herbie Bradley; Hossein A. Rahmani; Kyoung Whan Choe; Edward Grefenstette; Tim Rocktäschel

Do LLMs selectively encode the goal of an agent's reach?

Laura Ruis, Arduin Findeis, Herbie Bradley, Hossein A. Rahmani, Kyoung Whan Choe, Edward Grefenstette, Tim Rocktäschel

Published: 20 Jun 2023, Last Modified: 29 Jun 2023ToM 2023EveryoneRevisionsBibTeX

Keywords: LLM, agents, theory of mind, language, Woodward

TL;DR: All tested models appear to represent text with animate and inanimate actors differently, but only GPT-3.5-turbo and GPT-4 selectively encode an agent's goal so that they do not fail on our control task where animate actors act accidentally.

Abstract: In this work, we investigate whether large language models (LLMs) exhibit one of the earliest Theory of Mind-like behaviors: selectively encoding the goal object of an actor's reach (Woodward, 1998). We prompt state-of-the-art LLMs with ambiguous examples that can be explained both by an object or a location being the goal of an actor's reach, and evaluate the model's bias. We compare the magnitude of the bias in three situations: i) an agent is acting purposefully, ii) an inanimate object is acted upon, and iii) an agent is acting accidentally. We find that two models show a selective bias for agents acting purposefully, but are biased differently than humans. Additionally, the encoding is not robust to semantically equivalent prompt variations. We discuss how this bias compares to the bias infants show and provide a cautionary tale of evaluating machine Theory of Mind (ToM). We release our dataset and code.

Supplementary Material: pdf

Submission Number: 44

Loading