Keywords: Reasoning Models Sometimes Output Illegible Chains of Thought
TL;DR: We find that reasoning traces of a RL-trained model often have illegible segments, potentially compromising chain-of-thought monitoring for detecting malicious behavior.
Abstract: Language models trained via outcome-based reinforcement learning (RL) to reason using chain-of-thought (CoT) have shown remarkable performance. Monitoring such a model's CoT may allow us to understand its intentions and detect potential malicious behavior. However, to be effective, this requires that CoTs are legible and faithful. We evaluate the legibility of CoTs in state-of-the-art reasoning models. We find that R1, R1-Zero, and QwQ often produce illegible CoTs (mixing nonsensical phrases, random words, and non-English characters) before returning to perfect coherence in their final responses, while Claude models notably exhibit higher legibility. Across 14 models, we observe that larger models within the same training paradigm tend to produce more illegible reasoning. Prefill experiments show that truncating reasoning at a legibility threshold reduces accuracy by 53\%, suggesting that illegible portions contribute to performance despite being difficult to monitor. Illegibility increases with question difficulty, suggesting that CoT monitoring may be less reliable precisely when most needed. We discuss potential hypotheses for these results, including steganography, vestigial tokens, and training artifacts. Our findings suggest that current approaches to CoT monitoring may be vulnerable to the emergence of outcome-based RL, particularly as models face increasingly complex tasks.
Supplementary Material: zip
Primary Area: Social and economic aspects of machine learning (e.g., fairness, interpretability, human-AI interaction, privacy, safety, strategic behavior)
Submission Number: 28786
Loading