Keywords: Reinforcement Learning, LLM, Rewards
TL;DR: We present a study on the learning dynamics of reinforcement learning, revealing how agents perform policy optimization as they are challenged with tasks of escalating complexity.
Abstract: Policy optimization (PO) algorithms are used to refine Large Language Models (LLMs) for complex, multi-step reasoning. Current state-of-the-art pipelines enforce a strict think-then-answer format to elicit chain-of-thought (CoT); however, the behavior of PO when these rigid constraints are relaxed into an open-ended CoT structure remains an under-studied question. We investigate this gap with an extensive suite of controlled experiments and identify a powerful principle: \textit{policy optimization consistently follows the path of least resistance}. When afforded the flexibility to interleave reasoning and response, policy optimization consistently learns to discard explicit reasoning, causing the policy to degenerate to a direct \texttt{<answer>}-only format. This outcome holds true across a rigorous evaluation suite spanning 5 model families (4B-24B), 3 reasoning domains (math, code, logic), and 3 distinct PO algorithms (GRPO, DAPO, REINFORCE++). We find that this collapse in format is persistent even when the complex \texttt{<think><answer>} format is assigned up to 8x larger reward weights. We formalize this principle through a series of controlled reward decomposition experiments, demonstrating a clear hierarchy: PO systematically optimizes for the simplest reward component first, a preference that holds even when faced with mutually exclusive choices or strong incentives for more complex behaviors. Finally, we show that successful convergence on the high-reward shortcut is not a low-effort drift but is driven by the optimization process that requires the KL-regularized policy to have sufficient freedom to make a significant shift from its initial prior. Our findings reveal that granting policies the freedom to diverge is a double-edged sword: while necessary for discovering high-reward shortcuts, it also creates a powerful incentive to game the simplest aspects of the reward function, posing a critical challenge for reward hacking under alignment.
Primary Area: reinforcement learning
Submission Number: 7953
Loading