Reinforcement Learning for Long-Horizon Multi-Turn Search Agents

Vivek Kalyan; Martin Andrews

Reinforcement Learning for Long-Horizon Multi-Turn Search Agents

Vivek Kalyan, Martin Andrews

Published: 06 Oct 2025, Last Modified: 04 Nov 2025MTI-LLM @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY-ND 4.0

Keywords: Multi-Turn, RAG, Reinforcement Learning, Agent

TL;DR: We use RL to train a multi-turn RAG system, and run restricted-turn regimes to understand the value of multi-turns during training and at test-time

Abstract: Large Language Model (LLM) agents can leverage multiple turns and tools to solve complex tasks, with prompt-based approaches achieving strong performance. This work demonstrates that Reinforcement Learning (RL) can push capabilities significantly further by learning from experience. Through experiments on a legal document search benchmark, we show that our RL-trained 14 Billion parameter model outperforms frontier class models (85% vs 78% accuracy). In addition, we explore turn-restricted regimes, during training and at test-time, that show these agents achieve better results if allowed to operate over longer multi-turn horizons.

Submission Number: 162

Loading