Keywords: LLM, RAG, Efficiency
Abstract: Retrieval-Augmented Generation (RAG) is pivotal for modern Large Language Models. However, its practical deployment is often hindered by prohibitive inference costs, encompassing both latency and financial overhead from retrieval calls. Current reinforcement learning frameworks focus on improving search capability by solely maximizing answer accuracy, which inadvertently encourages excessive and costly search behavior. This overlooks the fundamental trade-off between task performance and computational efficiency.
To address this, we introduce \Ours, a systematic reinforcement learning framework that teaches models to balance answer quality with search cost. We find that naively penalizing search actions leads to unstable training and suboptimal policies.
Therefore,
\Ours employs a \emph{two-stage curriculum} that first builds robust search capabilities before introducing a cost-augmented reward function to cultivate efficiency.
This learning process is underpinned by a stabilized policy optimization algorithm, ensuring the model can robustly learn a judicious policy on when to search. Experiments across diverse question-answering benchmarks show that \Ours drastically reduces retrieval calls by up to 76.5\% while maintaining performance competitive with state-of-the-art models. By enabling a controllable balance between effectiveness and efficiency, \Ours provides a practical path toward building powerful yet economical RAG systems.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 6755
Loading