Policy Optimization with Experience Replay: Guiding Reasoning Models to Complete the Reasoning Path

Hongzhu Yi; Xinming Wang; Zhenghao Zhang; Tianyu Zong; Yuanxiang Wang; Jun Xie; Tao Yu; Zhepeng Wang; Kaixin Xu; Feng Chen; Jiahuan Chen; Yujia Yang; Zhenyu Guan; Jungang Xu

Policy Optimization with Experience Replay: Guiding Reasoning Models to Complete the Reasoning Path

Hongzhu Yi, Xinming Wang, Zhenghao Zhang, Tianyu Zong, Yuanxiang Wang, Jun Xie, Tao Yu, Zhepeng Wang, Kaixin Xu, Feng Chen, Jiahuan Chen, Yujia Yang, Zhenyu Guan, Jungang Xu

18 Sept 2025 (modified: 05 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, RL

Abstract: To our knowledge, in the field of large language models, all existing reinforcement fine-tuning algorithms require generating a complete reasoning process starting from the question, which results in a substantial time overhead during the rollout phase of training.Challenging this conventional approach, we propose the assumption that during reinforcement fine-tuning, the model only needs to generate part of the reasoning process. We analyze the impact of different segments of the reasoning path on the correctness of the final result, and based on these insights, we introduce \textbf{Policy Optimization with Experience Replay (POER)}, a plug-and-play reinforcement fine-tuning algorithm. Unlike traditional reinforcement fine-tuning algorithms that generate full reasoning paths, POER trains the model by generating suffixes of the reasoning path using experience caching, thereby significantly reducing training time while improving training stability.From evaluations during the rollout phase of training, POER reduces token generation in this phase by approximately 95\%, greatly lowering the theoretical time overhead. In practical training, compared with full-path reinforcement fine-tuning algorithms, POER reduces the training time of the 1.5B model by 90\% and the 7B model by 72\%, while maintaining performance comparable to typical algorithms such as GRPO and DAPO. We have open-sourced the code in an anonymous repository: \url{https://anonymous.4open.science/r/POER-4BF2}

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 12091

Loading