Keywords: Large Language Models, Adaptive Reasoning, Reinforcement Learning, Intrinsic Uncertainty, Efficient Inference
Abstract: Large Reasoning Models (LRMs) excel at complex tasks using Chain-of-Thought prompting but suffer from 'overthinking,' often inefficiently allocating costly reasoning resources to simple queries. Existing adaptive methods typically rely on opaque reinforcement learning strategies based on reasoning-length penalties, which lack both interpretability and intrinsic grounding. We argue that true efficiency stems from a model's internalized self-awareness of task complexity. In this work, we introduce \textbf{Dubito-Pro}, a framework where the LLM autonomously selects between "Fast" and "Slow" thinking modes based on the input context, without external classifiers or inference-time intervention. Our core insight is that Entropy Variance serves as a high-fidelity supervision signal for cognitive struggle during training. To instill this capability, we propose Intrinsic-Weighted Group Relative Policy Optimization (I-GRPO). Unlike standard RL approaches that reward solely outcome correctness, I-GRPO introduces a Cognitive Alignment Reward, computed post-hoc during training. This mechanism penalizes the model for selecting a 'Fast' path on high-variance (ambiguous) queries, effectively teaching it to anticipate its own uncertainty. Extensive experiments on a mixed-difficulty benchmark demonstrate that Dubito-Pro acquires a robust, intuitive switching policy. Reducing token costs by 80\% on simple tasks and improving overall accuracy by 7.47\%, this method establishes a new Pareto frontier for efficiency and accuracy.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: LLM Efficiency,NLP in resource-constrained settings
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 4217
Loading