Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning

wenlin zhang; Xiangyang Li; Kuicai Dong; Yichao Wang; Pengyue Jia; Xiaopeng Li; Yingyi Zhang; Derong Xu; Zhaocheng Du; Huifeng Guo; Ruiming Tang; Xiangyu Zhao

Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning

wenlin zhang, Xiangyang Li, Kuicai Dong, Yichao Wang, Pengyue Jia, Xiaopeng Li, Yingyi Zhang, Derong Xu, Zhaocheng Du, Huifeng Guo, Ruiming Tang, Xiangyu Zhao

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Process-Supervision, Agentic RAG

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by integrating external knowledge, yet traditional RAG systems struggle with static workflows and limited adaptability for complex, multistep reasoning tasks. Agentic RAG systems, such as DeepResearch, address these issues through dynamic retrieval, iterative context refinement, and adaptive workflows. However, recent methods like Search-R1, which rely on outcome-based reinforcement learning, face challenges such as low exploration efficiency, gradient conflict, and sparse reward signals. To tackle these limitations, we introduce ReasonRAG, a novel method that leverages RAG-ProGUIDE—a high-quality dataset providing fine-grained, process-level rewards for query generation, evidence extraction, and answer generation. By employing process-supervised reinforcement learning, ReasonRAG enhances LLMs’ autonomous capabilities in search, query generation, evidence extraction, and answer synthesis. Experimental results show that ReasonRAG, utilizing RAG-ProGUIDE, outperforms existing approaches like Search-R1 and traditional RAG systems, achieving superior performance on five benchmark datasets with only 5k training instances—significantly fewer than the 90k required by Search-R1. Our code is available at https://github.com/Applied-Machine-Learning-Lab/ReasonRAG.

Primary Area: Reinforcement learning (e.g., decision and control, planning, hierarchical RL, robotics)

Submission Number: 20812

Loading