Do LLM Agents Know How to Ground, Recover, and Assess? A Benchmark for Epistemic Competence in Information-Seeking Agents
Keywords: Epistemic Competence, Evidence-Grounded Reasoning, LLM Search Agents
Abstract: Recent work has explored training Large Language Model (LLM) search agents with reinforcement learning (RL) for open-domain question answering. However, most evaluations focus solely on final answer accuracy, overlooking how these agents reason with and act on external evidence.
We introduce **SeekBench**, the first process-level evaluation framework for LLM search agents that operationalize *epistemic competence* through metrics derived from an annotation schema.
We develop and validate our annotation schema using an expert-annotated dataset of 190 traces (over 1,800 steps).
To evaluate at scale, we introduce an LLM-as-judge pipeline.
Our framework provides granular analysis of whether agents demonstrate: (1) **groundedness**, by generating reasoning steps supported by observed evidence; (2) **recovery**, by adaptively reformulating searches to recover from low-quality results; and (3) **calibration**, by correctly assessing whether current evidence is sufficient to provide an answer.
By applying our evaluation framework to state-of-the-art search agents tuned on Qwen2.5-7B, we uncover critical behavioral gaps that answer-only metrics miss, as well as specialized skills such as Search-R1's synthesis abilities. These analyses highlight distinct epistemic competencies, offering actionable insights for the development of more capable and trustworthy agents.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 25205
Loading