The Inner Monologue of Language Models: When Reasoning Traces Reveal More Than They Hide

ACL ARR 2026 January Submission6128 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Human-Centric NLP, Ethics Bias and Fairness of LLMs, Generalizability and Transfer
Abstract: Recent advances in post-training techniques have endowed Large Language Models (LLMs) with enhanced capabilities for tackling complex, logic-intensive tasks through the generation of supplementary planning tokens. This development raises a fundamental question -- Are these models aware of what they "learn'' and "think''? To address this, we define three core competencies: (1) awareness of learned latent policies, (2) generalization of these policies across domains, and (3) alignment between internal reasoning traces and final outputs. We empirically evaluate these abilities on several tasks, each designed to require learning a distinct policy. Furthermore, we contrast the profiles of models post-trained via Supervised Fine‑Tuning (SFT), Direct Policy Optimization (DPO), and Group Relative Policy Optimization (GRPO). Our findings indicate that RL-trained models not only demonstrate greater awareness of their learned behaviors and stronger generalizability to novel, structurally similar tasks than SFT models but also often exhibit weak alignment between their reasoning traces and final outputs, an effect most pronounced in GRPO-trained models.
Paper Type: Long
Research Area: Human-AI Interaction/Cooperation and Human-Centric NLP
Research Area Keywords: Human-Centric NLP, Ethics Bias and Fairness of LLMs, Generalizability and Transfer
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 6128
Loading