LiveNewsBench: Evaluating LLM Web Search Capabilities with Freshly Curated News

ICLR 2026 Conference Submission24415 Authors

20 Sept 2025 (modified: 23 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Dataset, Benchmarks, Evaluation, LLM, Web Search, LLM Agents
Abstract: Large Language Models (LLMs) augmented with web search capabilities demonstrate strong potential on tasks requiring real-time knowledge access or retrieval of obscure facts. However, evaluating such systems remains challenging. Existing benchmarks like SimpleQA, BrowseComp, FreshQA and SealQA typically rely on fixed benchmark questions, making them difficult to disentangle genuine search abilities from memorized world knowledge, while also raising concerns around benchmark overfitting. Manual curation also limits these benchmarks to test-only settings, leading to a lack of open training data. To address these limitations, we introduce LiveNewsBench, a scalable, regularly updated, and challenging benchmark designed to rigorously assess the web search capabilities of LLMs. LiveNewsBench automatically generates fresh question-answer pairs from recent news articles, ensuring that solving the benchmark requires information beyond an LLM’s training data, thereby enabling a clear distinction between the model's internal knowledge vs. search skills. Our automated and scalable data pipeline supports construction of training, validation, and test sets, addressing the lack of open data for training web-search-enabled LLMs. The benchmark questions are deliberately challenging, requiring multiple search queries, page visits, and reasoning steps, making them suitable for assessing agentic search abilities of LLMs. To ensure reliable evaluation results, we include a subset of human-verified samples in the test set. We commit to updating LiveNewsBench quarterly over the next two years to maintain its recency. We use LiveNewsBench to evaluate a diverse suite of systems, including commercial, open-weight and local LLMs, as well as LLM-based web search APIs.
Primary Area: datasets and benchmarks
Submission Number: 24415
Loading