HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs

HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs

ACL ARR 2026 January Submission7514 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: [Reinforcement Learning, Overthinking, LLM Efficiency]

Abstract: Large Language Models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to improve accuracy on complex tasks. However, always generating lengthy reasoning traces is inefficient, leading to excessive token usage and higher inference costs. This paper introduces the Hybrid Policy Optimization (i.e., HiPO), a framework for adaptive reasoning control that enables LLMs to selectively decide when to engage in detailed reasoning (Think-on) and when to respond directly (Think-off). Specifically, HiPO combines a hybrid data pipeline—providing paired Think-on and Think-off responses—with a hybrid reinforcement learning reward system that balances accuracy and efficiency while avoiding over-reliance on detailed reasoning. Experiments across mathematics and coding benchmarks demonstrate that HiPO can substantially reduce token length while maintaining or improving accuracy. Finally, we hope HiPO can be a principled approach for efficient adaptive reasoning, advancing the deployment of reasoning-oriented LLMs in real-world, resource-sensitive settings.

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: [LLM Efficiency]

Contribution Types: Approaches low compute settings-efficiency

Languages Studied: [programming languages, natural languages]

Submission Number: 7514

Loading