LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA

LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA

ACL ARR 2025 May Submission6739 Authors

20 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Question Answering (QA) on narrative text poses a unique challenge for current systems, requiring a deep understanding of long, complex documents. However, the reliability of NarrativeQA, the most widely used benchmark in this domain, is hindered by noisy documents and flawed QA pairs. In this work, we introduce LiteraryQA, a high-quality subset of NarrativeQA focused on literary works. Using a human- and LLM-validated pipeline, we identify and correct low-quality QA samples while removing extraneous text from source documents. We then carry out a meta-evaluation of automatic metrics to reveal that all \textit{n}-gram-based metrics have a low system-level correlation to human judgment, while LLM-as-a-Judge evaluations, even with small open-weights models, can strongly agree with the ranking identified by humans. Finally, we benchmark a set of long-context LLMs on LiteraryQA. We release our code and data at https://omitted.link.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: automatic creation and evaluation of language resources,evaluation methodologies,benchmarking,metrics

Contribution Types: Data resources, Data analysis

Languages Studied: English

Submission Number: 6739

Loading