Keywords: theory of mind, mentalization
TL;DR: We analyze models' learned syntax of opponent beliefs given observations and find consistent failure modes
Abstract: We study whether neural models learn generalizable belief-updating rules in a competitive Theory of Mind (ToM) task. Using the Standoff competitive-feeding environment, we compare a deterministic, modular ToM baseline against end-to-end transformer models. While hardcoded models produce interpretable, rule-based belief updates, neural models learn approximations that overfit, exhibiting systematic errors on unseen opponent knowledge states. Through qualitative analysis of belief state update rules, we identify systematic failure modes including violations of object symmetry, temporal invariance, and egocentric bias.
Submission Number: 114
Loading