Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning

ACL ARR 2025 July Submission268 Authors

26 Jul 2025 (modified: 28 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: As test-time scaling becomes a pivotal research frontier in Large Language Models (LLMs) development, contemporary and advanced post-training methodologies increasingly focus on extending the generation length of long Chain-of-Thought (CoT) responses to enhance reasoning capabilities toward DeepSeek R1-like performance. However, recent studies reveal a persistent overthinking phenomenon in state-of-the-art reasoning models, manifesting as excessive redundancy or repetitive thinking patterns in long CoT responses. To address this issue, in this paper, we propose a simple yet effective two-stage reinforcement learning framework for achieving concise reasoning in LLMs, named \textbf{ConciseR}. Specifically, the first stage aims to incentivize the model's reasoning capabilities via Group Relative Policy Optimization with \textit{clip-higher} and \textit{dynamic sampling} (GRPO++), and the second stage explicitly enforces conciseness and improved efficiency via Length-aware Group Relative Policy Optimization (L-GRPO). \textbf{{Significantly, L-GRPO only optimizes response length once all rollouts of a sample are correct, following the "walk before you run" principle.}} Extensive experimental results demonstrate compatibility with reinforcement learning incentivized reasoning models, achieving sequence length reduction with accuracy improvements across AIME 2024, AMC 2023, MATH-500, Minerva, and Olympiad benchmarks.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: Large Language Models, Large Reasoning Models, Efficient Reasoning
Contribution Types: Approaches to low-resource settings
Languages Studied: English
Submission Number: 268
Loading