Training Reasoning Models with Dynamic Advantage Estimation on Reinforcement Learning

Training Reasoning Models with Dynamic Advantage Estimation on Reinforcement Learning

ACL ARR 2025 May Submission6275 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Reinforcement learning has become a cornerstone technique for developing reasoning models in complex tasks, ranging from mathematical problem-solving to imaginary reasoning. However, prevailing methods typically employ static advantage estimation, neglecting the dynamic utility of training samples over time. This limitation often results in slower convergence rates and increased learning instability, as models fail to adapt to evolving sample utilities effectively. To address this problem, we introduce ADORA (Advantage Dynamics via Online Rollout Adaptation), a simple yet effective reinforcement learning technique that dynamically differentiates training data into temporarily advantageous and disadvantageous samples through model rollouts guided by a tailored data differentiation strategy. Instead of static optimization, ADORA adjusts advantage signals on the fly, enabling more efficient policy updates. Extensive evaluations on various tasks demonstrate that ADORA significantly enhances long chain-of-thought reasoning in both mathematical and geometric tasks across large language models and vision–language models, achieving notable performance gains.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: reinforcement learning,optimization methods,generalization,continual learning

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings

Languages Studied: English

Submission Number: 6275

Loading