Probing and Steering Chain-of-Thought Unfaithfulness in Language Models

Published: 02 Mar 2026, Last Modified: 06 Mar 2026ICLR 2026 Re-Align WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Track: tiny / short paper (up to 5 pages)
Domain: machine learning
Abstract: Chain-of-Thought (CoT) explanations are essential for the monitoring and safety of Large Language Models (LLMs), yet they are susceptible to unfaithful rational- ization that could obfuscate dangerous behaviors. While prior work has focused on black-box methods, the internal mechanisms and white-box control of faithfulness remain under-explored. In this paper, we employ representation engineering to investigate the latent geometry of faithfulness in thinking language models. We demonstrate that instances of faithfulness are, to some extent, encoded as linear directions in middle-to-late layers, as shown by successful probing and steering of the models’ internals. We find that linear steering interventions achieve monitora- bility recovery rates up to 46% with collateral effects below 5%. We also find that off-policy steering methods have comparable utility to on-policy approaches.
Presenter: ~Giovanni_Maria_Occhipinti1
Submission Number: 9
Loading