Keywords: Large Language Model, Deep Search, On-Policy RL
Abstract: Deep search in LLMs hinges on efficiently acquiring external knowledge and up-to-date information to ground reasoning and generation. However, deep search agents often over-trust internal reasoning, terminate prematurely, and under-use external tools, resulting in brittle long-horizon performance. To address this, we introduce LATTE, a mixed-policy reinforcement learning framework that integrates teacher-forced, learner-adaptive reflection to provide oriented guidance that explicitly pushes the model to reflect, extend search rounds when evidence is insufficient, and increase the probability of beneficial tool calls. At each on-policy iteration, we seed reflective trajectories from the current policy’s deep-search rollouts and inject teacher-forced critiques and corrections at decision points that govern whether to continue or stop the search and whether to defer to a tool or proceed with self-reasoning. By conditioning guidance on the learner’s observed behavior and uncertainty, LATTE preserves on-policy updates while narrowing the gap between supervision and policy behavior, yielding an implicit curriculum focused on current failure modes (e.g., premature stopping, missed or delayed tool deferral, shallow exploration). Empirically, LATTE raises calibrated tool-use rates, lengthens effective search depth, and improves task success as well as training stability in advancing deep search optimization.
Primary Area: reinforcement learning
Submission Number: 23175
Loading