Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

ICLR 2026 Conference Submission16371 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning for LLMs, LLM Reasoning, Efficient Reasoning, Policy Optimization

TL;DR: GFPO: sample more outputs, filter by length/efficiency, and optimize only on the survivors—curbing chain-of-thought length inflation while matching GRPO-level accuracy.

Abstract: Large language models trained with reinforcement learning on verifiable rewards often inflate response length—trading brevity for accuracy. While longer reasoning can help on hard problems, many extra tokens are filler: verbose text making little progress. We introduce GFPO (Group Filtered Policy Optimization), which curbs this length explosion by sampling larger groups per problem and only training on responses filtered by (1) length and (2) token efficiency (reward per token). By sampling more during training time, GFPO teaches models to think less at inference time. On Phi-4-reasoning, GFPO cuts GRPO’s length inflation by up to 85\% across STEM and coding benchmarks (AIME 24/25, GPQA, Omni-MATH, LiveCodeBench) while preserving accuracy. We further propose Adaptive Difficulty GFPO, which allocates more training exploration to harder problems, yielding better efficiency-accuracy trade-offs on challenging questions. With only a 7\% increase in training time, GFPO reduces end-to-end latency by $\sim$30\%, cutting response time on hard queries by 90 seconds. GFPO trades modest training-time increases for lasting gains in inference—an effective recipe for efficient reasoning.

Primary Area: reinforcement learning

Submission Number: 16371

Loading