Keywords: Reinforcement Learning for LLMs, LLM Reasoning, Efficient Reasoning, Policy Optimization
TL;DR: GFPO: sample more outputs, filter by length/efficiency, and optimize only on the survivors—curbing chain-of-thought length inflation while matching GRPO-level accuracy.
Abstract: Large language models trained with reinforcement learning on verifiable rewards often inflate response length—trading brevity for accuracy. While longer reasoning can help on hard problems, many extra tokens are filler: verbose text making little progress. We introduce GFPO (Group Filtered Policy Optimization), which curbs this length explosion by sampling larger groups per problem and only training on responses filtered by (1) length and (2) token efficiency (reward per token). By sampling more during training-time, GFPO teaches models to think less at inference-time. On Phi-4-reasoning, GFPO cuts GRPO’s length inflation by up to 85\% across STEM and coding benchmarks (AIME 24/25, GPQA, Omni-MATH, LiveCodeBench) while preserving accuracy. We further propose Adaptive Difficulty GFPO, which allocates more training exploration to harder problems using real-time difficulty estimates, yielding better efficiency-accuracy trade-offs on challenging questions. GFPO demonstrates that modest extra training compute can deliver substantial test-time savings—an effective recipe for efficient reasoning.
Submission Number: 182
Loading