Stabilizing Efficient Reasoning with Step-Level Advantage Selection

Stabilizing Efficient Reasoning with Step-Level Advantage Selection

ACL ARR 2026 January Submission8992 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Reasoning, Efficient Reasoning, Reinforcement Learning

Abstract: Large language models (LLMs) achieve strong reasoning performance by allocating substantial computation at inference time, often generating long and verbose reasoning traces. While recent work on efficient reasoning seeks to reduce this overhead through length-based rewards or pruning mechanisms, many approaches rely on post-training with substantially shorter context windows than those used during base model training, a factor whose effect has not been systematically isolated. In this work, we first show that short-context post-training alone, using standard GRPO without any length-aware objectives, already induces substantial reasoning compression. However, this process leads to increasingly unstable training dynamics, characterized by accuracy degradation as training progresses. To address this issue, we propose Step-level Advantage Selection (SAS) that operates at the reasoning step level, selectively filtering noisy or redundant reasoning steps in correct rollouts, while preserving high-confidence intermediate reasoning steps even from verifier-failed rollouts, where failures may arise from truncation or verification issues rather than incorrect reasoning. Evaluating across diverse mathematical and general reasoning benchmarks, our approach reduces average reasoning length by more than 30% of tokens while consistently outperforming other baselines with average Pass@1 accuracy gains of 3.79 over a length-aware baseline. SAS achieves a consistently better accuracy–efficiency trade-off compared to strong length-aware baselines, demonstrating that stable and effective reasoning compression can be achieved with minimal modifications to standard reinforcement learning pipelines, highlighting the importance of disentangling context length effects from explicit length-control objectives in efficient reasoning.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: reinforcement learning

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 8992

Loading