Keywords: Contamination, Fact-Checking, Information Retrieval, Benchmark
Abstract: Evaluating information retrieval in agentic systems is increasingly difficult due to model contamination and tight coupling between retrieval and intervened agent reasoning. Large language models may recall fact checking knowledge from pretraining, while agents shape queries in ways that confound retrieval evaluation, causing standard end to end evaluations to yield conclusions that do not generalize across agentic architectures or datasets. We introduce a contamination aware evaluation framework for retrieval in agentic fact checking that fixes the language model and corpus and evaluates retrieval across diverse agentic-retriever interaction settings, enabling controlled analysis of how contamination and query generation affect retrieval quality independently of downstream reasoning. Our experiments show that contamination impacts retrieval behavior, retriever rankings are unstable across agentic systems due to query and retrieval interaction effects, and that different choices of how NDCG values are aggregated can lead to qualitatively different and even reversed comparisons between agents. For datasets with silver documents, we propose nDEv2R, a rank sensitive fact level retrieval metric that remains informative under incomplete evidence supervision. While instantiated in fact checking, our findings apply more broadly to evaluating retrieval components embedded in agentic systems such as question answering and multi document reasoning.
Paper Type: Long
Research Area: Information Extraction and Retrieval
Research Area Keywords: Contamination,Fact-Checking, Information Retrieval, Benchmark
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 1344
Loading