Abstract: Generating comprehensive and accurate Wikipedia articles for newly emerging real-world events presents significant challenges. Previous efforts have often fallen short by focusing only on short snippets, neglecting verifiability, or ignoring the impact of the pre-training corpus. In this paper, we simulate a real-world scenario where structured, full-length Wikipedia articles with citations are generated for new events using input documents from web sources. To minimize data leakage in Large Language Models (LLMs), we select recent events and construct a new benchmark, WIKIGENBENCH, consisting of 1320 events paired with their corresponding related web documents. We also design a comprehensive set of systematic metrics and LLM-based baseline methods to evaluate the capability of LLMs in generating factual, full-length Wikipedia articles. The data and code will be released upon acceptance.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking ;NLP datasets ; retrieval-augmented generation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 5329
Loading