The Monitorability Gap Between Reasoning in Thinking and Reasoning in the Output

Zora Che; David Lindner

The Monitorability Gap Between Reasoning in Thinking and Reasoning in the Output

Zora Che, David Lindner

Published: 23 May 2026, Last Modified: 23 May 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0

Keywords: monitorability, chain-of-thought, sabotage, evaluation

TL;DR: Output-mode reasoning and thinking-mode reasoning exhibit systematic feature differences, and downstream monitorability results do not transfer reliably, suggesting that output reasoning is an incomplete proxy for thinking-mode reasoning.

Abstract: Chains-of-thought (CoTs) produced during extended thinking offer a natural surface for monitoring reasoning models, but internal CoT is sometimes not exposed---leaving only external CoT, reasoning elicited in a model's visible output. How well does external CoT serve as a proxy for internal CoT? We study this across six open-source and proprietary reasoning models, and find that internal and external CoTs differ systematically along a small set of recurring features—meta-response construction, self-talk register, verification, exploratory backtracking, and signposted structure—with one or two features often sufficient to distinguish the two conditions at above 90% accuracy for some models. The gap is model-family-specific: classifiers trained on non-Claude models transfer poorly to Claude models, and it amplifies on harder datasets and when reasoning materially changes the final answer. Beyond stylistic and content differences, the downstream application of chain-of-thought monitoring demonstrates that reasoning in thinking exhibit weaker obfuscation ability markedly for two tasks of taking covert actions. In sandbagging experiments, models aware of monitoring achieve substantially higher covert success rates in external CoT than internal CoT (e.g., 47% vs. 8% for Sonnet~4), driven by selective suppression of explicit intent verbalization. In hint-following experiments, monitor-aware external CoT is markedly less detectable than internal CoT. Together, these results show that external CoT is not a drop-in substitute for studying internal CoT for covert and sabotage tasks: when models are aware they are being observed, external reasoning can hide misaligned behavior more effectively than their hidden thinking would.

Track: Regular Paper (9 pages)

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 133

Loading