Keywords: agentic search, boundary awareness, Reinforcement Learning
Abstract: RL-based agentic search enables LLMs to solve complex questions via dynamic planning and external search. While this approach significantly enhances accuracy with agent policies optimized via large-scale reinforcement learning, we identify a critical gap in reliability: these agents fail to recognize their reasoning boundaries and rarely admit ``\textit{I DON'T KNOW}'' (\textit{IDK}) even when evidence is insufficient or reasoning reaches its limit. The lack of reliability often leads to plausible but unreliable answers, introducing significant risks in many real-world scenarios. To this end, we propose \underline{\textbf{B}}oundary-\underline{\textbf{A}}ware \underline{\textbf{P}}olicy \underline{\textbf{O}}ptimization \textbf{(BAPO)}, a novel RL framework designed to cultivate reliable boundary awareness without compromising accuracy. BAPO introduces two key components: (i) a group-based boundary-aware reward that encourages an \textit{IDK} response only when the reasoning reaches its limit, and (ii) an adaptive reward modulator that strategically suspends this reward during early exploration, preventing the model from exploiting \textit{IDK} as a shortcut. Extensive experiments on four benchmarks demonstrate that BAPO substantially enhances the overall reliability of agentic search. Code and data will be released upon acceptance.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: reinforcement learning in agents, safety and alignment for agents, tool use
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 8639
Loading