Keywords: Chain of Thought, Chain of Thought Metrics, Model Organisms, Pathological Reasoning, Encoded Reasoning, Internalized Reasoning, Post-Hoc Rationalization
TL;DR: We develop metrics to measure how pathological chain-of-thought reasoning is, and model organisms to measure metric effectiveness
Abstract: Chain-of-thought (CoT) reasoning is fundamental to modern LLM architectures and represents a critical intervention point for AI safety. If models are incapable of performing harmful actions without reasoning efforts in the CoT, monitoring the CoT becomes a valuable tool for implementing safety guardrails. However, CoT reasoning may have properties which prevent it from being used for monitoring; we call these properties "pathologies". Prior work has identified three distinct pathologies: post-hoc rationalization, where models generate plausible explanations backwards from predetermined answers; encoded reasoning, where intermediate steps conceal information within seemingly interpretable text; and internalized reasoning, where models replace explicit reasoning with meaningless filler tokens while computing internally. To better understand and discriminate between these pathologies, we present a systematic set of novel health metrics that are simple to implement, computationally inexpensive, and task-agnostic. To validate our approach, we develop ``model organisms'', models deliberately trained to exhibit specific CoT pathologies, and demonstrate that our metrics can reliably diagnose these conditions. Crucially, we show that each pathology produces a distinct signature across our metric suite, enabling differential diagnosis between different types of pathologies. We apply our diagnostic framework to multiple open-weight frontier models, revealing their CoT health signatures and the prevalence of these pathologies in current systems. Our work provides the first practical toolkit for assessing CoT pathologies at scale, with direct implications for model interpretability, scalable oversight, and AI alignment.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 14002
Loading