R$^2$PO: Decoupling Training Trajectories from Inference Responses for LLM Reasoning

ACL ARR 2026 January Submission7631 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Reinforcement Learning, Reasoning, Policy Optimization, Exploration-Exploitation
Abstract: Reinforcement learning has become a central paradigm for improving LLM reasoning. However, existing methods use a single policy to produce both inference responses and training optimization trajectories. The objective conflict between generating stable inference responses and diverse training trajectories leads to insufficient exploration, which harms reasoning capability. In this paper, to address the problem, we propose $R^2$PO (Residual Rollout Policy Optimization), which introduces a lightweight Residual Rollout-Head atop the policy to decouple training trajectories from inference responses, enabling controlled trajectory diversification during training while keeping inference generation stable. Experiments across multiple benchmarks show that $R^2$PO consistently outperforms baselines, achieving average accuracy gains of 3.1% on MATH-500 and 2.4\% on APPS, while also reducing formatting errors and mitigating length bias for stable optimization. Our code is publicly available at https://github.com/RRPO-ARR/Code.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: fine-tuning, safety and alignment, chain-of-thought, robustness
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 7631
Loading