Keywords: Large Language Models (LLMs), Benchmarks, Retrieval-Augmented Generation (RAG), Search agent
Abstract: Static benchmarks often conflate memorization with reasoning, failing to capture the dynamic nature of world knowledge.
We present \textsc{LiveSearchBench}, an automated pipeline constructing retrieval-dependent benchmarks from knowledge graph differentials.
Unlike prior dynamic evaluations focused on simple fact updates, our method synthesizes complex, multi-constraint questions guaranteed to have unique answers via strict SPARQL validation.
Experiments reveal a pronounced ``Recency Gap'': models struggle significantly with facts post-dating their pretraining, particularly on multi-hop queries.
While retrieval-augmented generation (RAG) offers partial gains, it fails to close this gap, limited by distinct failures in both indexing novel entities and reasoning over evidence.
\textsc{LiveSearchBench} thus shifts reasoning evaluation from static memorization toward rigorous, real-time evidence integration under evolving knowledge.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: search agent, rag, nlp, benchmark, Reasoning
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 7803
Loading