Keywords: deepresearch, deepsearch, reasoning
Abstract: We present Fathom-Search-4B, tool-using LLM specialized in evidence-based DeepSearch. Our approach combines three advances. First, DUETQA, a ~5K-sample training dataset generated via our novel multi-agent self-play framework that can be used to synthesize question–answer pairs with strict live-web-search dependence, post model cut-off date bias, and heterogeneity of web sources beyond Wikipedia. Second, we introduce RAPO, a zero-overhead extension of GRPO that stabilizes multi-turn Reinforcement Learning with Verifiable Rewards (RLVR) via three upgrades: (i) curriculum-inspired pruning of saturated prompts; (ii) reward-aware advantage scaling that preserves gradient magnitude under sparse rewards; and (iii) a per-prompt replay buffer that injects the latest successful rollout into failed groups, restoring reward variance and stabilizing relative-advantage estimates. Third, we design a steerable step-level reward that classifies each tool call by cognitive behaviour and marginal utility (e.g., exploration, verification, redundancy), enabling explicit control over search breadth, cross-source verification depth, and overall tool-use horizon; this reliably extends effective trajectories beyond 20 tool calls when warranted. The agent operates with a goal-conditioned web-search stack (live web search via a search engine + targeted web-page querying via an LLM). Evaluated on DeepSearch benchmarks (e.g., SimpleQA, FRAMES, WebWalker, Seal0, MuSiQue) and out-of-domain reasoning suites (HLE, AIME-25, GPQA-Diamond, MedQA), Fathom-Search-4B attains state-of-the-art results in the open-weights category across all DeepSearch benchmarks, and achieves significant improvements in general reasoning tasks via tool-integrated reasoning.
Submission Number: 142
Loading