HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs

HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs

ICLR 2026 Conference Submission16231 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Language models, Natural language processing

Abstract: Large Language Models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to improve accuracy on complex tasks. However, always generating lengthy reasoning traces is inefficient, leading to excessive token usage and higher inference costs. This paper introduces the Hybrid Policy Optimization (i.e., HiPO), a framework for adaptive reasoning control that enables LLMs to selectively decide when to engage in detailed reasoning (think-on) and when to respond directly (think-off). We construct a cross-domain, logically rich dataset using a hybrid multi-agent construction pipeline that provides explicit supervision for reasoning-mode selection. Then, building on this data, we introduce a hybrid reinforcement learning (RL) reward system that integrates mode-specific rewards with global bonuses to align reasoning quality with efficiency. Experiments across mathematics, coding, and general knowledge benchmarks demonstrate that HiPO can substantially reduce token length while maintaining or improving accuracy. Further analysis shows that HiPO learns fine-grained, context-sensitive reasoning behavior, activating CoT primarily on reasoning-intensive tasks and suppressing it when unnecessary.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 16231

Loading