Adaptive Inference‑Time Scaling for LRMs using Uncertainty‑Aware RL

Adaptive Inference‑Time Scaling for LRMs using Uncertainty‑Aware RL

ICLR 2026 Conference Submission25317 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: uncertainty-guided self-braking tuning (USBT), adaptive inference, large reasoning models (LRMs), reasoning depth control, uncertainty-aware reinforcement learning, semantic entropy (confidence), chain-of-thought (CoT), early exit, S‑GRPO, GRPO, reward shaping, length penalties, branch‑parallel decoding, token reduction, latency reduction, compute efficiency, inference-time scaling, self‑regulation

TL;DR: USBT learns RL policies that throttle LRM reasoning depth using uncertainty (semantic entropy) plus length penalties, yielding concise CoT. S‑GRPO adds early‑exit control with parallel search, cutting tokens and latency, maintaining accuracy.

Abstract: The widespread adoption of Large Reasoning Models (LRMs), such as Gemini 2.5 Pro Deep Think, OpenAI GPT-5 Pro, and SuperGrok 4 Heavy, is bottlenecked by their computational inefficiency, primarily stemming from the “overthinking phenomenon”—the propensity to generate unnecessarily long Chain-of-Thought (CoT) sequences even for simple queries. This verbose output, while enhancing accuracy, substantially increases inference costs and latency. Current efforts to mitigate this rely on L1 methods like explicit token budget instructions or post-hoc truncation, which either lack precise control or struggle to generalize across varying task complexities. We propose Uncertainty-Guided Self-Braking Tuning (USBT), an L2 adaptive inference framework that addresses the overthinking issue by enabling LRMs to autonomously regulate their reasoning depth based on real-time internal uncertainty. We frame adaptive inference as a sequential decision-making process optimized via Reinforcement Learning (RL), building on core algorithms like Group Relative Policy Optimization (GRPO). Our novel contribution is integrating a confidence metric, such as certainindex based on semantic entropy, into the RL reward function alongside explicit length penalties. This reward function incentivizes the model to produce concise, correct reasoning paths and facilitates an early exit strategy. Techniques like Serial-Group Decaying-Reward Policy Optimization (S-GRPO), which serialize early-exit interventions and decay rewards for later completions, demonstrate that this paradigm achieves substantial token reduction (35.4%–61.1%) while boosting accuracy. Our USBT framework generalizes this approach by actively coupling the decay/penalty coefficients with the measured uncertainty, allowing the model to recognize and inhibit excessive reasoning, cultivating an intrinsic ability to self-regulate without relying on external control. Furthermore, integrating this uncertainty-based self-regulation with inference acceleration strategies, such as branch-parallel decoding, significantly reduces end-to-end latency. Experiments incorporating our self-braking mechanism consistently show dramatic reductions in token consumption (up to 60%) across complex benchmarks while maintaining high performance.

Primary Area: reinforcement learning

Submission Number: 25317

Loading