Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory

Published: 30 Sept 2025, Last Modified: 02 Oct 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Chain of Thought/Reasoning models, AI Safety
TL;DR: LLMs can be trained to produce more monitorable CoTs.
Abstract: Chain-of-thought (CoT) monitors are LLM-based systems that analyze reasoning traces to detect when outputs may exhibit attributes of interest such as harmfulness, bias, or other properties. In this paper, we use information-theoretic analysis to show that a task requiring CoT is a necessary, but not sufficient, condition for CoT monitorability. We identify two sources of approximation errors that may undermine the performance of CoT monitors in practice: _information gap_ which intuitively measures the extent to which the monitor can extract the information available in CoT, and _elicitation error_ which measures the extent to which the monitor approximates the optimal monitoring function. We further demonstrate that CoT monitorability can be systematically improved through targeted training objectives. To this end, we propose two complementary approaches: (a) an oracle-based method that directly rewards the monitored model for producing CoTs that maximize monitor accuracy, and (b) a more practical, label-free approach that maximizes conditional mutual information between outputs and CoTs. In a coding honeypot environment, we show both methods significantly improve monitor accuracy while preventing CoT degeneration even when training against a monitor, thereby mitigating reward hacking even when the task reward is imperfectly specified.
Submission Number: 280
Loading