Evaluating Reasoning over Novel Facts Beyond Parametric Memory

Evaluating Reasoning over Novel Facts Beyond Parametric Memory

ACL ARR 2026 January Submission7803 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models (LLMs), Benchmarks, Retrieval-Augmented Generation (RAG), Search agent

Abstract: Static benchmarks often conflate memorization with reasoning, failing to capture the dynamic nature of world knowledge. We present \textsc{LiveSearchBench}, an automated pipeline constructing retrieval-dependent benchmarks from knowledge graph differentials. Unlike prior dynamic evaluations focused on simple fact updates, our method synthesizes complex, multi-constraint questions guaranteed to have unique answers via strict SPARQL validation. Experiments reveal a pronounced ``Recency Gap'': models struggle significantly with facts post-dating their pretraining, particularly on multi-hop queries. While retrieval-augmented generation (RAG) offers partial gains, it fails to close this gap, limited by distinct failures in both indexing novel entities and reasoning over evidence. \textsc{LiveSearchBench} thus shifts reasoning evaluation from static memorization toward rigorous, real-time evidence integration under evolving knowledge.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: search agent, rag, nlp, benchmark, Reasoning

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 7803

Loading