Keywords: LLM, Post-training, GRPO
Abstract: Group Relative Policy Optimization (GRPO) introduces a new paradigm for reinforcement learning in Large Language Models (LLMs), modifying PPO by eliminating the value model for efficient post-training. However, vanilla GRPO assigns equal weight to all prompts during policy updates, ignoring that supervision whose target answers are inconsistent with the model’s existing parameter knowledge can increase hallucinations and degrade downstream performance. To address this limitation, we propose SEED-GRPO (Semantic Entropy EnhanceD GRPO), which explicitly measures LLMs’ uncertainty and uses it to modulate the learn- ing process. This enables conservative updates for high-uncertainty prompts (e.g., beyond model knowledge) while preserving relatively higher signals for confident ones. Experimental results on five mathematical reasoning benchmarks (AIME24 56.7, AMC 68.7, MATH 83.4, Minerva 34.2, and OlympiadBench 48.0) and on four few-shot fine-grained image classification datasets demonstrate that SEED- GRPO achieves new state-of-the-art performance in average accuracy. The code, implementation details will be publicly released.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 6966
Loading