Keywords: Large Language Models, Efficient Reasoning
Abstract: As test-time scaling becomes a pivotal research frontier in the development of Large Language Models (LLMs), contemporary and advanced post-training methodologies increasingly focus on extending the generation length of long Chain-of-Thought (CoT) responses to enhance reasoning capabilities, aiming for DeepSeek R1-like performance. However, recent studies reveal a persistent overthinking phenomenon in state-of-the-art reasoning baselines, manifesting as excessive redundancy or repetitive thinking patterns. To address this issue, in this paper, we propose \textbf{G}roup \textbf{L}ength \textbf{P}olicy \textbf{O}ptimization (\textbf{GLPO}), a simple yet effective approach, for achieving concise reasoning in LLMs. Specifically, GLPO penalizes CoT outputs in its reward function based on their length relative to the max output length, thereby encouraging the model to generate shorter, higher-quality CoT outputs to solve problems. Meanwhile, GLPO only optimizes the length of CoT outputs once all rollouts of a sample are correct, following the "\textit{walk before you run}" principle. Experimental results show that the model trained with GLPO, which generates more concise CoT outputs, outperforms recent state-of-the-art reasoning models with zero RL paradigm across AIME 2024, MATH-500, AMC 2023, Minerva, and Olympiad benchmarks. The model checkpoints will be publicly released.
Paper Type: Long
Research Area: Low-resource Methods for NLP
Research Area Keywords: Efficient/Low-Resource Methods for NLP
Contribution Types: Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 5890
Loading