Beyond Outcome Reward: Decoupling Search and Answering Improves LLM Agents

Yiding Wang; Zhepei Wei; Xinyu Zhu; Yu Meng

Beyond Outcome Reward: Decoupling Search and Answering Improves LLM Agents

Yiding Wang, Zhepei Wei, Xinyu Zhu, Yu Meng

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, Search Agents, Reinforcement Learning

Abstract: Enabling Large Language Models (LLMs) to utilize search tools offers a promising path to overcoming fundamental limitations such as knowledge cutoffs and hallucinations. Recent work has explored reinforcement learning (RL) for training search-augmented agents that interleave reasoning and retrieval before answering. These approaches usually rely on outcome-based rewards (e.g., exact match accuracy), implicitly assuming that optimizing for final answers will also yield effective intermediate search behaviors. Our analysis challenges this assumption: we uncover multiple systematic deficiencies in search that arise under outcome-only training and ultimately degrade final answer quality, including failure to invoke tools, invalid queries, and redundant searches. To address these shortcomings, we introduce **DeSA** (**De**coupling **S**earch-and-**A**nswering), a simple two-stage training framework that explicitly separates search optimization from answer generation. In Stage 1, agents are trained to improve search effectiveness with a recall-based reward. In Stage 2, outcome rewards are employed to optimize final answer generation. Across a variety of QA benchmarks, **DeSA**-trained agents consistently improve search behaviors, delivering substantially higher search recall and answer accuracy than outcome-only baselines. Notably, DeSA outperforms single-stage training approaches that simultaneously optimize recall and outcome rewards, underscoring the necessity of explicitly decoupling the two objectives.

Primary Area: reinforcement learning

Submission Number: 9617

Loading