Effective Reinforcement Learning for Reasoning in Language Models

Effective Reinforcement Learning for Reasoning in Language Models

ACL ARR 2025 May Submission4636 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Reinforcement learning (RL) has emerged as a promising strategy for improving the reasoning capabilities of language models (LMs) in domains such as mathematics and coding. However, most modern RL algorithms were designed to target robotics applications, which differ significantly from LM reasoning. We analyze RL algorithm design decisions for LM reasoning, for both accuracy and computational efficiency, focusing on relatively small models due to computational constraints. Our findings are: (i) on-policy RL significantly outperforms supervised fine-tuning (SFT), (ii) PPO-based off-policy updates increase accuracy instead of reduce variance, and (iii) removing KL divergence can lead to concise generations and higher accuracy. Furthermore, we find that a key bottleneck to computational efficiency is that the optimal batch sizes for inference and backpropagation are different. We propose a novel algorithm, DASH, that performs $\textit{preemptive sampling}$ (i.e., sample a large batch and accumulate gradient updates in small increments), and $\textit{gradient filtering} $(i.e., drop samples with small advantage estimates). We show that DASH reduces training time by 83\% compared to a standard implementation of GRPO without sacrificing accuracy. Our findings provide valuable insights on designing effective RL algorithms for LM reasoning.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: large language models, efficient training, reinforcement learning, reasoning

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 4636

Loading