A Principled Approach to Chain-of-Thought Monitorability in Reasoning Models

A Principled Approach to Chain-of-Thought Monitorability in Reasoning Models

ICLR 2026 Conference Submission15204 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: monitorobility, reasoning, faithfulness

TL;DR: A Principled Approach to Chain-of-Thought Monitorability in Reasoning Models

Abstract: Chain-of-thought (CoT) lets a model show its intermediate steps, which can make it easier to interpret, check, and control its behavior. However, these CoT traces are often not truly monitorable. They can be unfaithful, hiding the real reasons for an answer, or verbose, making it difficult to monitor. It is difficult to fix this with standard reinforcement learning based fine-tuning since models only get feedback on their final answer, not on intermediate reasoning steps. To bypass this prob- lem, we formulate the CoT monitorability as a constraint optimization problem and propose a new pipeline to solve that problem. We leverage existing instruct models as efficient priors to transform current CoT traces into high-quality ones that satisfy monitorability constraints, ensuring faithfulness and conciseness. We then use our newly generated dataset to teach the base model how to reason prop- erly through standard supervised fine-tuning. On MMLU-Pro (with hint injec- tion), GSM8K, and MATH500, our approach improves faithfulness by about an additional 10% and shortens CoTs by up to 60% while keeping accuracy essen- tially unchanged. Our results open a path toward more interpretable, transparent, and controllable CoT reasoning.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 15204

Loading