Keywords: monitorobility, reasoning, faithfulness
TL;DR: A Principled Approach to Chain-of-Thought Monitorability in Reasoning Models
Abstract: Chain-of-thought (CoT) lets a model show its intermediate steps, which can make
it easier to interpret, check, and control its behavior. However, these CoT traces
are often not truly monitorable. They can be unfaithful, hiding the real reasons for
an answer, or verbose, making it difficult to monitor. It is difficult to fix this with
standard reinforcement learning based fine-tuning since models only get feedback
on their final answer, not on intermediate reasoning steps. To bypass this prob-
lem, we formulate the CoT monitorability as a constraint optimization problem
and propose a new pipeline to solve that problem. We leverage existing instruct
models as efficient priors to transform current CoT traces into high-quality ones
that satisfy monitorability constraints, ensuring faithfulness and conciseness. We
then use our newly generated dataset to teach the base model how to reason prop-
erly through standard supervised fine-tuning. On MMLU-Pro (with hint injec-
tion), GSM8K, and MATH500, our approach improves faithfulness by about an
additional 10% and shortens CoTs by up to 60% while keeping accuracy essen-
tially unchanged. Our results open a path toward more interpretable, transparent,
and controllable CoT reasoning.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 15204
Loading