Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn Search Agents

Guoqing Wang; Sunhao Dai; Guangze Ye; Zeyu Gan; Wei Yao; Yong Deng; Xiaofeng Wu; Zhenzhe Ying

Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn Search Agents

Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao, Yong Deng, Xiaofeng Wu, Zhenzhe Ying

Published: 26 Jan 2026, Last Modified: 12 May 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Turn-Level Reward, Search Agent, Agentic RL

TL;DR: We propose IGPO, a simple and effective reinforcement learning framework with turn-level reward for training multi-turn search agents.

Abstract: Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and information acquisition. However, existing approaches typically rely on outcome-based rewards that are provided only after generating the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate three critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals; (ii) lack of fine-grained credit assignment, where the correctness of intermediate turns is obscured, especially in long-horizon tasks; and (iii) poor sample efficiency, where each rollout yields only a single outcome signal, leading to low data utilization. In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training. IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal ground- truth confidence gain across consecutive turns. Unlike prior process-level reward approaches that rely on external reward models, manual process annotations, or costly Monte Carlo estimation, IGPO obtains dense intrinsic rewards directly from the model’s own belief updates. These intrinsic turn-level rewards are combined with outcome-level supervision to form dense reward signals. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved data efficiency. Our code is available at https://github.com/GuoqingWang1/IGPO.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 4748

Loading