Evaluating Reasoning over Novel Facts Beyond Parametric Memory

ACL ARR 2026 January Submission7803 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models (LLMs), Benchmarks, Retrieval-Augmented Generation (RAG), Search agent
Abstract: Static benchmarks often conflate memorization with reasoning, failing to capture the dynamic nature of world knowledge. We present \textsc{LiveSearchBench}, an automated pipeline constructing retrieval-dependent benchmarks from knowledge graph differentials. Unlike prior dynamic evaluations focused on simple fact updates, our method synthesizes complex, multi-constraint questions guaranteed to have unique answers via strict SPARQL validation. Experiments reveal a pronounced ``Recency Gap'': models struggle significantly with facts post-dating their pretraining, particularly on multi-hop queries. While retrieval-augmented generation (RAG) offers partial gains, it fails to close this gap, limited by distinct failures in both indexing novel entities and reasoning over evidence. \textsc{LiveSearchBench} thus shifts reasoning evaluation from static memorization toward rigorous, real-time evidence integration under evolving knowledge.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: search agent, rag, nlp, benchmark, Reasoning
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 7803
Loading