Keywords: LLM
Abstract: Reinforcement learning (RL) has become a cornerstone for improving the reasoning ability of large language models (LLMs).
The current mainstream Group Relative Policy Optimization (GRPO) estimates advantage via relative comparisons within the full group of sampled responses.
However, this single-scale, global comparison mechanism is inherently brittle, sensitive to the heterogeneity and stochasticity of reward distribution, leading to unstable training signals.
Drawing inspiration from graph theory, where node importance is better captured through local substructures than global statistics, we propose Multi-Scale Group Relative Policy Optimization (MS-GRPO), a novel RL algorithm that generalizes GRPO by aggregating relative advantages computed across multiple response subgroups at varying scales (e.g. pairwise, trios, etc.).
Since the exhaustive enumeration of all meaningful subgroups grows combinatorially with group size, we further introduce a practical acceleration scheme that selects a small yet representative subset of subgroups via dilated scale sampling and diversity-aware subgroup selection. In addition, we provide a rigorous theoretical analysis, demonstrating that MS-GRPO can be interpreted as an adaptive correction of GRPO's advantage controlled by the heterogeneity of reward distribution, and gracefully degenerates to GRPO when the reward distribution approaches homogeneity.
Experiments demonstrate that MS-GRPO significantly outperforms GRPO on various tasks, for example, with improvements averaged over all evaluated models: +5.5 on AIME24 math reasoning, +4.6 on RiddleSense logical reasoning, +2.7 on LiveCodeBench programming challenges, +2.2 on MedQA medical reasoning, and +13.5 on HotpotQA with search engine.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 4604
Loading