BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search

BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search

ACL ARR 2026 January Submission8639 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: agentic search, boundary awareness, Reinforcement Learning

Abstract: RL-based agentic search enables LLMs to solve complex questions via dynamic planning and external search. While this approach significantly enhances accuracy with agent policies optimized via large-scale reinforcement learning, we identify a critical gap in reliability: these agents fail to recognize their reasoning boundaries and rarely admit ``\textit{I DON'T KNOW}'' (\textit{IDK}) even when evidence is insufficient or reasoning reaches its limit. The lack of reliability often leads to plausible but unreliable answers, introducing significant risks in many real-world scenarios. To this end, we propose \underline{\textbf{B}}oundary-\underline{\textbf{A}}ware \underline{\textbf{P}}olicy \underline{\textbf{O}}ptimization \textbf{(BAPO)}, a novel RL framework designed to cultivate reliable boundary awareness without compromising accuracy. BAPO introduces two key components: (i) a group-based boundary-aware reward that encourages an \textit{IDK} response only when the reasoning reaches its limit, and (ii) an adaptive reward modulator that strategically suspends this reward during early exploration, preventing the model from exploiting \textit{IDK} as a shortcut. Extensive experiments on four benchmarks demonstrate that BAPO substantially enhances the overall reliability of agentic search. Code and data will be released upon acceptance.

Paper Type: Long

Research Area: AI/LLM Agents

Research Area Keywords: reinforcement learning in agents, safety and alignment for agents, tool use

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 8639

Loading