Do LLM Agents Know How to Ground, Recover, and Assess? A Benchmark for Epistemic Competence in Information-Seeking Agents

ICLR 2026 Conference Submission25205 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Epistemic Competence, Evidence-Grounded Reasoning, LLM Search Agents
Abstract: Recent work has explored training Large Language Model (LLM) search agents with reinforcement learning (RL) for open-domain question answering (QA). However, most evaluations focus solely on final answer accuracy, overlooking how these agents reason with and act on external evidence. We introduce **SeekBench**, the first benchmark for evaluating the *epistemic competence* of LLM search agents through step-level analysis of their response traces. **SeekBench** comprises 190 expert-annotated traces with over 1,800 response steps generated by LLM search agents, each enriched with evidence annotations for granular analysis of whether agents (1) generate reasoning steps grounded in observed evidence, (2) adaptively reformulate searches to recover from low-quality results, and (3) have proper calibration to correctly assess whether the current evidence is sufficient for providing an answer. Our analysis of state-of-the-art LLM search agents reveals critical behavioral gaps overlooked by traditional metrics, including specialized skills like Search-R1's synthesis capabilities. These findings expose distinct epistemic competencies that accuracy-only evaluations fail to capture, providing guidance for developing more capable and reliable agents.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 25205
Loading