Policy Optimization Prefers The Path Of Least Resistance

Debdeep Sanyal; Aakash Sen Sharma; Dhruv Kumar; Saurabh Deshpande; Murari Mandal

Policy Optimization Prefers The Path Of Least Resistance

Debdeep Sanyal, Aakash Sen Sharma, Dhruv Kumar, Saurabh Deshpande, Murari Mandal

16 Sept 2025 (modified: 02 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, LLM, Rewards

TL;DR: We present a study on the learning dynamics of reinforcement learning, revealing how agents perform policy optimization as they are challenged with tasks of escalating complexity.

Abstract: Policy optimization (PO) algorithms are used to refine Large Language Models (LLMs) for complex, multi-step reasoning. Current state-of-the-art pipelines enforce a strict think-then-answer format to elicit chain-of-thought (CoT); however, the behavior of PO when these rigid constraints are relaxed into an open-ended CoT structure remains an under-studied question. We investigate this gap with an extensive suite of controlled experiments and identify a powerful principle: \textit{policy optimization consistently follows the path of least resistance}. When afforded the flexibility to interleave reasoning and response, policy optimization consistently learns to discard explicit reasoning, causing the policy to degenerate to a direct \texttt{<answer>}-only format. This outcome holds true across a rigorous evaluation suite spanning 5 model families (4B-24B), 3 reasoning domains (math, code, logic), and 3 distinct PO algorithms (GRPO, DAPO, REINFORCE++). We find that this collapse in format is persistent even when the complex \texttt{<think><answer>} format is assigned up to 8x larger reward weights. We formalize this principle through a series of controlled reward decomposition experiments, demonstrating a clear hierarchy: PO systematically optimizes for the simplest reward component first, a preference that holds even when faced with mutually exclusive choices or strong incentives for more complex behaviors. Finally, we show that successful convergence on the high-reward shortcut is not a low-effort drift but is driven by the optimization process that requires the KL-regularized policy to have sufficient freedom to make a significant shift from its initial prior. Our findings reveal that granting policies the freedom to diverge is a double-edged sword: while necessary for discovering high-reward shortcuts, it also creates a powerful incentive to game the simplest aspects of the reward function, posing a critical challenge for reward hacking under alignment.

Primary Area: reinforcement learning

Submission Number: 7953

Loading