Keywords: Language models, Natural language processing
Abstract: Large Language Models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to improve accuracy on complex tasks. However, always generating lengthy reasoning traces is inefficient, leading to excessive token usage and higher inference costs. This paper introduces the Hybrid Policy Optimization (i.e., HiPO), a framework for adaptive reasoning control that enables LLMs to selectively decide when to engage in detailed reasoning (think-on) and when to respond directly (think-off). We construct a cross-domain, logically rich dataset using a hybrid multi-agent construction pipeline that provides explicit supervision for reasoning-mode selection. Then, building on this data, we introduce a hybrid reinforcement learning (RL) reward system that integrates mode-specific rewards with global bonuses to align reasoning quality with efficiency. Experiments across mathematics, coding, and general knowledge benchmarks demonstrate that HiPO can substantially reduce token length while maintaining or improving accuracy. Further analysis shows that HiPO learns fine-grained, context-sensitive reasoning behavior, activating CoT primarily on reasoning-intensive tasks and suppressing it when unnecessary.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 16231
Loading