ADORA: Training Reasoning Models with Dynamic Advantage Estimation on Reinforcement Learning

Qingnan Ren; Zhen Fang; Shiting Huang; Zehui Chen; Lin Chen; Lijun Li; Feng Zhao

ADORA: Training Reasoning Models with Dynamic Advantage Estimation on Reinforcement Learning

Qingnan Ren, Zhen Fang, Shiting Huang, Zehui Chen, Lin Chen, Lijun Li, Feng Zhao

13 Sept 2025 (modified: 06 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, optimization methods, generalization, continual learning

TL;DR: We introduce ADORA (Advantage Dynamics via Online Rollout Adaptation), a novel RL framework designed to dynamically calibrate advantage estimation.

Abstract: Reinforcement learning has become a cornerstone technique for developing reasoning models in complex tasks, ranging from mathematical problem-solving to imaginary reasoning. The optimization of these models typically relies on policy gradient methods, whose efficacy hinges on the accurate estimation of an advantage function. However, prevailing methods typically employ static advantage estimation, a practice that leads to inefficient credit assignment by neglecting the dynamic utility of training samples over time. This limitation results in suboptimal policy updates, which in turn manifest as slower convergence rates and increased learning instability, as models fail to adapt to evolving sample utilities effectively. To address this problem, we introduce ADORA (Advantage Dynamics via Online Rollout Adaptation), a novel framework for policy optimization. ADORA dynamically adjusts the advantage function's weighting by adaptively categorizing training data into temporarily advantageous and disadvantageous samples, based on their evolving utility during online model rollouts. This tailored data differentiation strategy allows ADORA to be seamlessly integrated into existing policy optimization algorithms without significant architectural modifications, enabling the policy to prioritize learning from more informative experiences and thereby achieve more efficient policy updates. Extensive evaluations on various tasks demonstrate that ADORA significantly enhances long reasoning in both geometric and mathematical tasks across large vision–language models and large language models, achieving notable performance gains.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 4721

Loading