The Ends Justify the Thoughts: RL-Induced Motivated Reasoning in LLMs

Published: 23 Sept 2025, Last Modified: 07 Dec 2025FoRLM 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reasoning, RL, chain of thought, faithfulness, monitoring
TL;DR: RL finetuning reasoning models on misaligned human preferences leads to untrustworthy reasoning
Abstract: The use of reinforcement learning (RL) with chain-of-thought (CoT) reasoning has unlocked a new level of performance in frontier language models. In turn, a new optimism has emerged among AI safety researchers: on the one hand, that spending test-time compute can improve alignment, and on the other, that CoT monitoring can help detect harmful behaviors such as scheming or reward hacking. In this paper, we showcase a failure mode of CoT trustworthiness. Specifically, we show that training reasoning models with RL on misaligned human preferences can lead them to downplay or ignore safety risks in their CoT, and focus instead on finding reasons to justify their dangerous behavior. We find similar effects with models trained with RL but without CoT reasoning, as well as with models trained to reason with reference to a constitution. We hope that these findings provide a useful warning for reasoning model training: namely, that RL finetuning on human feedback without successfully filtering harmful conversations may greatly amplify unfaithful reasoning in the CoT, which in turn may make harmful model behaviors harder to detect when using CoT monitoring. All code for this paper will be made available. WARNING: some examples in this paper may be upsetting.
Submission Number: 188
Loading