DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning

DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning

ACL ARR 2026 January Submission3020 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diversity-aware Reward, GRPO, LLM for Mathematical reasoning, Debais

Abstract: Post-training LLMs with Reinforcement Learning, specifically Group Relative Policy Optimization (GRPO), has emerged as a paradigm for enhancing mathematical reasoning. However, standard GRPO relies on scalar correctness rewards that are often \textit{non-injective} with respect to semantic content: distinct reasoning paths receive identical rewards. This leads to a $\textit{Diversity-Quality Inconsistency}$, where the policy collapses into a narrow set of dominant modes while ignoring equally valid but structurally novel strategies. To bridge this gap, we propose $\textbf{D}$iversity-aware $\textbf{R}$eward $\textbf{A}$djustment ($\textbf{DRA}$), a theoretically grounded framework that calibrates the reward signal using the semantic density of sampled groups. By leveraging Submodular Mutual Information (SMI), DRA implements an $\textit{Inverse Propensity Scoring (IPS)}$ mechanism that effectively de-biases the gradient estimation. This creates a repulsive force against redundancy, driving the policy to achieve better coverage of the high-reward landscape. Our method is plug-and-play and integrates seamlessly with GRPO variants. Empirical evaluations on five math benchmarks demonstrate that DRA-GRPO consistently outperforms strong baselines, achieving an average accuracy of 58.2\% on DeepSeek-R1-Distill-Qwen-1.5B with only 7,000 training samples and \$55 cost, highlighting the critical role of diversity calibration in data-efficient alignment.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: reinforcement learning, optimization methods, alignment, fine-tuning, math QA, reasoning, data-efficient training, chain-of-thought

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data analysis

Languages Studied: English

Submission Number: 3020

Loading