Literary Evidence Retrieval via Long-Context Language Models

ACL ARR 2025 February Submission7237 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: How well do modern long-context language models understand literary fiction? We explore this question via the task of literary evidence retrieval, repurposing the RELiC dataset of Thai et al. to construct a benchmark where the entire text of a work (e.g., The Great Gatsby) is provided to an LLM alongside literary criticism with a missing quotation from that work. This setting, in which the model must generate the missing quotation, mirrors the human process of literary analysis by requiring models to perform both global narrative reasoning and close textual examination. To ensure high-quality evaluation, we curate a high-quality subset of 252 examples through extensive filtering and human verification. Our experiments show that large long-context models, such as Gemini 1.5 Pro and GPT-4o, retrieve literary evidence more effectively than state-of-the-art retrievers, yet still trail behind human experts (40% vs. 52.5% accuracy). Moreover, smaller open-weight models, such as LLaMA 3.1 and Qwen2.5, achieve only 2–5% accuracy, showing that they lack the reasoning abilities essential for literary interpretation. We release our dataset to facilitate future applications of LLMs to literary analysis.
Paper Type: Short
Research Area: Computational Social Science and Cultural Analytics
Research Area Keywords: literary analysis, English literature, literary scholarship
Contribution Types: Data analysis
Languages Studied: English
Submission Number: 7237
Loading