Learning to Reason Efficiently with Discounted Reinforcement Learning

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: reinforcement learning, reasoning, blackwell optimality, post training
Abstract: Large reasoning models (LRMs) often consume excessive tokens, inflating computational cost and latency. We challenge the assumption that longer responses improve accuracy. By penalizing the reasoning tokens using a discounted reinforcement-learning setup (interpretable as a small per-token cost) and analyzing Blackwell optimality in restricted policy classes, we encourage concise yet accurate reasoning; in practice we discount only the environment (correctness) reward. Experiments confirm our theoretical results that this approach shortens chains of thought while preserving accuracy.
Primary Area: reinforcement learning
Submission Number: 5154
Loading