Abstract: Reinforcement learning has become a cornerstone technique for developing reasoning models in complex tasks, ranging from mathematical problem-solving to imaginary reasoning. However, prevailing methods typically employ static advantage estimation, neglecting the dynamic utility of training samples over time. This limitation often results in slower convergence rates and increased learning instability, as models fail to adapt to evolving sample utilities effectively. To address this problem, we introduce ADORA (Advantage Dynamics via Online Rollout Adaptation), a simple yet effective reinforcement learning technique that dynamically differentiates training data into temporarily advantageous and disadvantageous samples through model rollouts guided by a tailored data differentiation strategy. Instead of static optimization, ADORA adjusts advantage signals on the fly, enabling more efficient policy updates. Extensive evaluations on various tasks demonstrate that ADORA significantly enhances long chain-of-thought reasoning in both mathematical and geometric tasks across large language models and vision–language models, achieving notable performance gains.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: reinforcement learning,optimization methods,generalization,continual learning
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings
Languages Studied: English
Submission Number: 6275
Loading