Keywords: Reasoning, RL, chain of thought, faithfulness, monitoring
TL;DR: RL training reasoning models on misaligned human preferences leads to "motivated reasoning" in the chain of thought
Abstract: The use of reinforcement learning (RL) with chain-of-thought (CoT) reasoning has emerged as a promising approach for developing more capable language mod- els. In turn, this has led to investigation of CoT monitoring as a promising method for detecting harmful behaviors such as reward hacking, under the assumption that models’ reasoning processes reflect their internal decision-making. In prac- tice, LLM training often produces unintended behaviors due to imperfect reward signals, leading models to develop misaligned tendencies. A common correc- tive approach is to apply post-hoc instructions to avoid problematic behaviors (such as sycophancy or cheating tests), but what happens to the model’s reasoning process when these instructions conflict with learned behaviors? We investigate this question in simple settings and find that models engage in systematic moti- vated reasoning—generating plausible-sounding justifications for violating their instructions while downplaying potential harms. Beyond being an interesting property of training, we find that while motivated reasoning can be detected by most frontier reasoning models, smaller LLM judges can fail to identify a portion of it, and in rare cases can themselves be persuaded that the reasoning is correct, despite it contradicting clear instructions. This capability gap raises concerns that as models become more sophisticated, their motivated reasoning may become in- creasingly difficult for monitors to detect. Our results underscore the need to account for motivated reasoning when relying on chain-of-thought processes for model evaluation and oversight. All code for this paper will be made available. WARNING: some examples in this paper may be upsetting.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22430
Loading