Keywords: LLM benchmarking, dynamic evaluation, knowledge updates, automated benchmarks, retrieval-augmented methods
Abstract: Knowledge memorization is central to large language models (LLMs) and is typically assessed using static benchmarks derived from sources like Wikipedia and textbooks. However, these benchmarks fail to capture evolving knowledge in a dynamic world, and centralized curation struggles to keep pace with rapid LLM advancements.
To address this, we propose a fully automated framework for generating high-quality, dynamic knowledge benchmarks on demand. Focusing on the news domain, where knowledge updates daily, we design an agentic framework to automate the sourcing, creation, validation, and distribution of benchmarks while promoting quality and efficiency.
Our approach democratizes benchmark creation and facilitates robust evaluation of retrieval-augmented methods by reducing overlap with pretraining data. We evaluate a range of LLMs, both open-source and proprietary, across various sizes and configurations—with and without retrieval—on freshly generated knowledge. Our results reveal distinct model behaviors when confronted with new information and highlight how retrieval narrows the performance gap between small and large models.
These findings underscore the importance of evaluating LLMs on evolving benchmarks to more accurately estimate their knowledge capabilities and guide future advancements.
Archival Status: Non-archival (not included in proceedings)
Submission Number: 50
Loading