Walk Before You Run! Group Length Policy Optimization for Efficient LLM Reasoning

Walk Before You Run! Group Length Policy Optimization for Efficient LLM Reasoning

ACL ARR 2026 January Submission5890 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Efficient Reasoning

Abstract: As test-time scaling becomes a pivotal research frontier in the development of Large Language Models (LLMs), contemporary and advanced post-training methodologies increasingly focus on extending the generation length of long Chain-of-Thought (CoT) responses to enhance reasoning capabilities, aiming for DeepSeek R1-like performance. However, recent studies reveal a persistent overthinking phenomenon in state-of-the-art reasoning baselines, manifesting as excessive redundancy or repetitive thinking patterns. To address this issue, in this paper, we propose \textbf{G}roup \textbf{L}ength \textbf{P}olicy \textbf{O}ptimization (\textbf{GLPO}), a simple yet effective approach, for achieving concise reasoning in LLMs. Specifically, GLPO penalizes CoT outputs in its reward function based on their length relative to the max output length, thereby encouraging the model to generate shorter, higher-quality CoT outputs to solve problems. Meanwhile, GLPO only optimizes the length of CoT outputs once all rollouts of a sample are correct, following the "\textit{walk before you run}" principle. Experimental results show that the model trained with GLPO, which generates more concise CoT outputs, outperforms recent state-of-the-art reasoning models with zero RL paradigm across AIME 2024, MATH-500, AMC 2023, Minerva, and Olympiad benchmarks. The model checkpoints will be publicly released.

Paper Type: Long

Research Area: Low-resource Methods for NLP

Research Area Keywords: Efficient/Low-Resource Methods for NLP

Contribution Types: Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 5890

Loading