Keywords: Diversity-aware Reward, GRPO, LLM for Mathematical reasoning, Debais
Abstract: Post-training LLMs with Reinforcement Learning, specifically Group Relative Policy Optimization (GRPO), has emerged as a paradigm for enhancing mathematical reasoning. However, standard GRPO relies on scalar correctness rewards that are often \textit{non-injective} with respect to semantic content: distinct reasoning paths receive identical rewards. This leads to a $\textit{Diversity-Quality Inconsistency}$, where the policy collapses into a narrow set of dominant modes while ignoring equally valid but structurally novel strategies.
To bridge this gap, we propose $\textbf{D}$iversity-aware $\textbf{R}$eward $\textbf{A}$djustment ($\textbf{DRA}$), a theoretically grounded framework that calibrates the reward signal using the semantic density of sampled groups.
By leveraging Submodular Mutual Information (SMI), DRA implements an $\textit{Inverse Propensity Scoring (IPS)}$ mechanism that effectively de-biases the gradient estimation. This creates a repulsive force against redundancy, driving the policy to achieve better coverage of the high-reward landscape.
Our method is plug-and-play and integrates seamlessly with GRPO variants. Empirical evaluations on five math benchmarks demonstrate that DRA-GRPO consistently outperforms strong baselines, achieving an average accuracy of 58.2\% on DeepSeek-R1-Distill-Qwen-1.5B with only 7,000 training samples and \$55 cost, highlighting the critical role of diversity calibration in data-efficient alignment.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: reinforcement learning, optimization methods, alignment, fine-tuning, math QA, reasoning, data-efficient training, chain-of-thought
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data analysis
Languages Studied: English
Submission Number: 3020
Loading