DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models

ACL ARR 2025 May Submission246 Authors

09 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advances in reinforcement learning for language model post-training, such as Group Relative Policy Optimization (GRPO), have shown promise in low-resource settings. However, GRPO typically relies on solution-level and scalar reward signals that fail to capture the semantic diversity among sampled completions. This leads to what we identify as a diversity-quality inconsistency, where distinct reasoning paths may receive indistinguishable rewards. To address this limitation, we propose \textit{Diversity-aware Reward Adjustment} (DRA), a method that explicitly incorporates semantic diversity into the reward computation. DRA uses Submodular Mutual Information (SMI) to downweight redundant completions and amplify rewards for diverse ones. This encourages better exploration during learning, while maintaining stable exploitation of high-quality samples. Our method integrates seamlessly with both GRPO and its variant DR.GRPO, resulting in $\textit{DRA-GRPO}$ and $\textit{DGA-DR.GRPO}$. Experiments on five mathematical reasoning benchmarks show that our method outperforms recent strong baselines, achieving state-of-the-art performance with an average accuracy of 58.2\% using only 7,000 fine-tuning samples and a total training cost of around \$55.
Paper Type: Short
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: Diversity-aware Reward, GRPO, LLM for Mathematical reasoning
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data analysis
Languages Studied: English
Submission Number: 246
Loading