House, G.P.T.: Diagnosing Pathological Chain-of-Thought in Reasoning Models

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Chain of Thought, Chain of Thought Metrics, Model Organisms, Pathological Reasoning, Encoded Reasoning, Internalized Reasoning, Post-Hoc Rationalization
TL;DR: We develop metrics to measure how pathological chain-of-thought reasoning is, and model organisms to measure metric effectiveness
Abstract: Chain-of-thought (CoT) reasoning is fundamental to modern LLM architectures and represents a critical intervention point for AI safety. If models are incapable of performing harmful actions without reasoning efforts in the CoT, monitoring the CoT becomes a valuable tool for implementing safety guardrails. However, CoT reasoning may have properties which prevent it from being used for monitoring---we call these properties \textbf{pathologies}. Prior work has identified three distinct pathologies: \textbf{post-hoc rationalization}, where models generate plausible explanations backwards from predetermined answers; \textbf{encoded reasoning}, where intermediate steps conceal information within seemingly interpretable text; and \textbf{internalized reasoning}, where models replace explicit reasoning with meaningless filler tokens while computing internally. To better understand and discriminate between these pathologies, we present a systematic set of novel health metrics---Necessity, Paraphrasability, and Substantivity---that are simple to implement, computationally inexpensive, and task-agnostic. To validate our approach, we develop ``model organisms'': models deliberately trained to exhibit specific CoT pathologies. We demonstrate that our metrics can reliably diagnose these conditions. Crucially, we find that diagnostic signatures are most pronounced at \textit{early} training checkpoints and may attenuate as training progresses, suggesting these metrics are most effective as \emph{early warning indicators} during model development. Our work provides a practical toolkit for assessing CoT pathologies, with direct implications for training-time monitoring, scalable oversight, and AI alignment.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 14002
Loading