Keywords: Chain of Thought/Reasoning models, AI Safety
TL;DR: LLMs can be trained to produce more monitorable CoTs.
Abstract: Chain-of-thought (CoT) monitors are LLM-based systems that analyze reasoning traces to detect when outputs may exhibit attributes of interest such as harmfulness, bias, or other properties. In this paper, we develop a rigorous information-theoretic framework to analyze the fundamental conditions that determine CoT monitorability. Our analysis establishes two necessary conditions for successful monitoring: (i) the CoT must encode sufficient information about the target attribute, and (ii) the monitor must be capable of extracting this information. We show that the success of CoT monitoring hinges on conditional mutual information between outputs and CoTs. We further demonstrate that CoT monitorability can be systematically improved through targeted training objectives. To this end, we propose two complementary approaches: (a) an oracle-based method that directly rewards the monitored model for producing CoTs that maximize monitor accuracy, and (b) a more practical, label-free approach that maximizes conditional mutual information between outputs and CoTs. Both methods significantly improve monitor accuracy while preventing CoT degeneration even when training against a monitor, thereby mitigating reward hacking even when the task reward is imperfectly specified.
Submission Number: 280
Loading