Keywords: reasoning, deepsearch, deepresearch
Abstract: We present Fathom-Search-4B, a 4B-parameter, tool-using LLM trained to perform evidence-based DeepSearch over heterogeneous sources (HTML, PDFs, blogs). Our approach combines three advances. First, DUETQA, a ~5K example dataset generated via multi-agent self-play, enforces live-web dependence, post-2024 recency, and diversity beyond Wikipedia. Second, we introduce RAPO, a zero-overhead extension of GRPO that stabilizes multi-turn RL via prompt-level pruning of saturated items, batch-aware advantage scaling to preserve gradient magnitude, and a per-prompt replay buffer that restores variance. Third, we design a steerable step-level reward that labels each tool call as exploration, verification, or redundant, allowing explicit control over search breadth, cross-source verification depth, and overall tool-use horizon; this reliably extends effective trajectories beyond 10+ tool calls when warranted. The agent operates with a goal-conditioned retrieval stack (search selection + targeted page querying), improving signal-to-noise versus snippet-only or greedy retrieval. Evaluated on DeepSearch benchmarks (e.g., SimpleQA, FRAMES, WebWalker, Seal0, MuSiQue) and out-of-domain reasoning suites (HLE, AIME-25, GPQA-Diamond, MedQA), Fathom-Search-4B attains state-of-the-art results among open models, with large gains on retrieval-heavy tasks and strong transfer to STEM/medical QA.
Submission Number: 235
Loading