From Shortcuts to Reasoning: Robust Post-Training of Theory of Mind with Reinforcement Learning

Jike Zhong; Yuxiang Lai; Ming Li; Yuheng Li; Wuao Liu; Behzad Dariush; Konstantinos Psounis; Shao-Yuan Lo

From Shortcuts to Reasoning: Robust Post-Training of Theory of Mind with Reinforcement Learning

Jike Zhong, Yuxiang Lai, Ming Li, Yuheng Li, Wuao Liu, Behzad Dariush, Konstantinos Psounis, Shao-Yuan Lo

04 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: theory of mind, reasoning, reinforcement finetuning, large language model

Abstract: Theory of Mind (ToM) is a must-acquire skill for modern foundation model systems to operate effectively and safely in the real world. Recent works have explored honing ToM via post-training; however, we show that such progress is confounded by a pervasive “shortcut” issue: tasks can reach up to 99% accuracy by simply exploiting spurious causal correlations, leading to a false sense of ToM. Motivated by this, we first develop a framework to systematically examine ToM datasets for shortcuts and provide guidance for future development. We find that questions reducible to pure state tracking (e.g., “belief”) are especially shortcut-prone compared to mind questions (e.g., “intention”) where reasoning beyond tracking is required. Using four shortcut-free datasets across three ToM contexts, we then comprehensively study whether reinforcement-learning fine-tuning with verifiable rewards and explicit reasoning (Thinking-RFT) elevates ToM beyond supervised fine-tuning (SFT). Our key findings are: 1) Thinking-RFT effectively improves ToM in all scenarios (+6% vs. SFT), particularly in complex higher-order reasoning (+10% vs. SFT) and multimodal cases (+7% vs. SFT), and generalizes notably better to unseen domains and higher-order queries while being more robust to counterfactuals. 2) ToM benefits specifically from the joint effect of reasoning and RL: Thinking-RFT outperforms No-Thinking-RFT by 7% on average. 3) RFT works by learning to ground its reasoning on anchor cues (keywords/state changes) that correspond to causal factors. We believe our study is useful for developing effective and robust ToM post-training datasets and advancing critical ToM capabilities in foundation models.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 1862

Loading