The Elephant in the Coreference Room: Resolving Coreference in Full-Length French Fiction Works

ACL ARR 2025 May Submission6500 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: While coreference resolution is attracting more interest than ever from computational literature researchers, representative datasets of fully annotated long documents remain surprisingly scarce. In this paper, we introduce a new annotated corpus of three full-length French novels, totaling over 285,000 tokens. Unlike previous datasets focused on shorter texts, our corpus addresses the challenges posed by long, complex literary works, enabling evaluation of coreference models in the context of long reference chains. We present a modular coreference resolution pipeline that allows for fine-grained error analysis. We show that our approach is competitive models and scales effectively to long documents. Finally, we demonstrate its usefulness to infer the gender of fictional characters, showcasing its relevance for both literary analysis and downstream NLP tasks.
Paper Type: Long
Research Area: Discourse and Pragmatics
Research Area Keywords: coreference resolution, anaphora resolution, corpus creation, NLP datasets, named entity recognition and relation extraction, entity linking/disambiguation
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: French
Submission Number: 6500
Loading