Abstract: Theory of Mind (ToM) is the ability to understand others' mental states, which is essential for human social interaction. Although recent studies suggest that large language models (LLMs) exhibit human-level ToM capabilities, the underlying mechanisms remain unclear. "Simulation Theory" posits that we infer others' mental states by simulating their cognitive processes, which has been widely discussed in cognitive science. In this work, we propose a framework for investigating whether the ToM mechanism in LLMs is based on Simulation Theory by analyzing their internal representations. Following this framework, we successfully controlled LLMs' ToM reasoning through modeled perspective-taking and counterfactual interventions. Our results provide initial evidence that state-of-the-art LLMs implement an emergent ToM partially based on Simulation Theory, suggesting parallels between human and artificial social reasoning.
Paper Type: Short
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: interpretability, cognitive modeling, computational psycholinguistics, probing, counterfactual/contrastive explanations
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 3484
Loading