Extracting Belief-Update Rules to Explain Theory-of-Mind Generalization Failures

Joel Phillips Michelson; Deepayan Sanyal; Maithilee Kunda

Extracting Belief-Update Rules to Explain Theory-of-Mind Generalization Failures

Joel Phillips Michelson, Deepayan Sanyal, Maithilee Kunda

Published: 23 Sept 2025, Last Modified: 17 Feb 2026CogInterp @ NeurIPS 2025 RejectEveryoneRevisionsBibTeXCC BY 4.0

Keywords: theory of mind, mentalization

TL;DR: We analyze models' learned syntax of opponent beliefs given observations and find consistent failure modes

Abstract: We study whether neural models learn generalizable belief-updating rules in a competitive Theory of Mind (ToM) task. Using the Standoff competitive-feeding environment, we compare a deterministic, modular ToM baseline against end-to-end transformer models. While hardcoded models produce interpretable, rule-based belief updates, neural models learn approximations that overfit, exhibiting systematic errors on unseen opponent knowledge states. Through qualitative analysis of belief state update rules, we identify systematic failure modes including violations of object symmetry, temporal invariance, and egocentric bias.

Submission Number: 114

Loading