Probing and Steering Chain-of-Thought Unfaithfulness in Language Models
Track: tiny / short paper (up to 5 pages)
Domain: machine learning
Abstract: Chain-of-Thought (CoT) explanations are essential for the monitoring and safety
of Large Language Models (LLMs), yet they are susceptible to unfaithful rational-
ization that could obfuscate dangerous behaviors. While prior work has focused on
black-box methods, the internal mechanisms and white-box control of faithfulness
remain under-explored. In this paper, we employ representation engineering to
investigate the latent geometry of faithfulness in thinking language models. We
demonstrate that instances of faithfulness are, to some extent, encoded as linear
directions in middle-to-late layers, as shown by successful probing and steering of
the models’ internals. We find that linear steering interventions achieve monitora-
bility recovery rates up to 46% with collateral effects below 5%. We also find that
off-policy steering methods have comparable utility to on-policy approaches.
Presenter: ~Giovanni_Maria_Occhipinti1
Submission Number: 9
Loading