PilotRAG: Teaching LLMs Multi-Turn Hybrid RAG via Reinforcement Learning

Yucan Guo; Miao Su; Saiping Guan; Zihao Sun; Xiaolong Jin; Jiafeng Guo; Xueqi Cheng

PilotRAG: Teaching LLMs Multi-Turn Hybrid RAG via Reinforcement Learning

Yucan Guo, Miao Su, Saiping Guan, Zihao Sun, Xiaolong Jin, Jiafeng Guo, Xueqi Cheng

17 Sept 2025 (modified: 30 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Retrieval-Augmented Generation, Reinforcement Learning

TL;DR: We propose PilotRAG, an RL-based RAG framework that enables LLMs to perform multi-turn hybrid RAG with both high accuracy and efficiency.

Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge, typically from unstructured texts or structured graphs. While recent progress has extended text-based RAG to multi-turn reasoning through Reinforcement Learning (RL), existing graph-based and hybrid RAG methods generally rely on fixed or handcrafted multi-turn retrieval procedures rather than an RL-trained policy, and thus do not support adaptive, decision-based multi-turn reasoning. This limitation restricts their ability to incrementally integrate supplementary evidence as reasoning unfolds, thereby reducing their effectiveness on complex multi-hop questions. To address this limitation, we introduce PilotRAG, an RL-based framework that enables LLMs to perform multi-turn and adaptive graph-text hybrid RAG by dynamically interleaving reasoning, hybrid retrieval, and answer formulation. PilotRAG jointly optimizes the entire generation process via RL, allowing the model to learn when to reason, what to retrieve from either unstructured texts or structured graphs, and when to produce final answers, all within a unified generation policy. To guide this learning process, we design a two-stage training framework with a reward function that accounts for both task outcome and retrieval efficiency. By rewarding answer accuracy and efficient retrieval while penalizing redundant retrieval operations, the model learns to retrieve selectively and reason effectively. Experiments on both simple and multi-hop question answering benchmarks demonstrate that PilotRAG significantly outperforms existing RAG baselines, highlighting the benefits of end-to-end RL for enabling adaptive and iterative retrieval in complex reasoning scenarios.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 8781

Loading