Probing and Steering Chain-of-Thought Unfaithfulness in Language Models

Published: 01 Mar 2026, Last Modified: 05 Apr 2026TTU at ICLR 2026 (Main) OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Chain-of-Thought (CoT) explanations are essential for the monitoring and safety of Large Language Models (LLMs), yet they are susceptible to unfaithful rationalization that could obfuscate dangerous behaviors. While prior work has focused on black-box methods, the internal mechanisms and white-box control of faithfulness remain under-explored. In this paper, we employ representation engineering to investigate the latent geometry of faithfulness in thinking language models. We demonstrate that instances of faithfulness are, to some extent, encoded as linear directions in middle-to-late layers, as shown by successful probing and steering of the models’ internals. We find that linear steering interventions achieve monitorability recovery rates up to 46% with collateral effects below 5%. We also find that off-policy steering methods have comparable utility to on-policy approaches.
Submission Number: 1
Loading