Keywords: Multi-Hop RAG, Efficiency, Reasoning, SLMs
Abstract: Reinforcement learning (RL) based on the final answer's reward has driven recent progress in small language models (SLMs) on reasoning-heavy tasks such as math and code. However, applying the same techniques to retrieval-augmented generation (RAG) benchmarks like multi-hop QA has yielded limited gains—often trailing supervised or prompting-only baselines. Instead, we argue that a viable path for RL in multi-hop QA is to use test-time scaling judiciously, for optimizing both the final answer accuracy and the efficiency in reaching that answer.
We propose FrugalRAG, a two-stage finetuning framework that adaptively _reduces_ the number of retrieval steps based on a question's difficulty. First, we train an SLM with supervised finetuning on a full-exploration policy that generates broad sub-queries. Then, we apply RL to adaptively prune search depth based on question difficulty, directly rewarding policies that balance correctness with frugality. Unlike prior approaches requiring 10× more data, our method achieves competitive performance with only ~1,000 examples. On HotPotQA and other multi-hop QA benchmarks, FrugalRAG attains state-of-the-art efficiency–accuracy tradeoffs, cutting retrieval cost nearly in half. Moreover, on the challenging BrowseCompPlus benchmark, it surpasses SLM-based and other baselines after training on only 200 examples. These results demonstrate the use of RL—not to increase reasoning steps but to reduce them—as an effective solution for scalable, efficient RAG.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 25237
Loading