DLER: Doing Length pEnalty Right — Incentivizing More Intelligence per Token via Reinforcement Learning

DLER: Doing Length pEnalty Right — Incentivizing More Intelligence per Token via Reinforcement Learning

ICLR 2026 Conference Submission8094 Authors

16 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reasoning Model, Reasoning Efficiency, CoT Efficiency, Test-time scaling

TL;DR: Doing Length Penalty Right introduces a reinforcement learning approach that incentivizes models to deliver more intelligence per token

Abstract: Reasoning language models such as OpenAI-o1, DeepSeek-R1, and Qwen achieve strong performance via extended chains of thought but often generate unnecessarily long outputs. Maximizing intelligence per token—accuracy relative to response length—remains an open problem. We revisit reinforcement learning (RL) with the simplest length penalty—truncation—and show that accuracy degradation arises not from the lack of sophisticated penalties but from inadequate RL optimization. We identify three key challenges—large bias in advantage estimation, entropy collapse, and sparse reward signal—and address them with $\textbf{D}\text{oing} \textbf{L}\text{ength} \text{p}\textbf{E}\text{nalty} \textbf{R}\text{ight}$ ($\textbf{DLER}$), a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and simple truncation length penalty. DLER achieves state-of-the-art accuracy–efficiency trade-offs, cutting output length by over 70\% while surpassing all previous baseline accuracy. It also improves test-time scaling: compared to DeepSeek-R1-7B, DLER-7B generates multiple concise responses in parallel with 28\% higher accuracy and lower latency. We further propose Difficulty-Aware DLER, which adaptively tightens truncation on easier questions for additional efficiency gains, and an update-selective merging method that preserves baseline accuracy while retaining the concise reasoning ability of the DLER model which is useful for scenarios where RL training data is scarce.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 8094

Loading