Abstract: While coreference resolution is attracting more interest than ever from computational literature researchers, representative datasets of fully annotated long documents remain surprisingly scarce. In this paper, we introduce a new annotated corpus of three full-length French novels, totaling over 285,000 tokens. Unlike previous datasets focused on shorter texts, our corpus addresses the challenges posed by long, complex literary works, enabling evaluation of coreference models in the context of long-distance reference chains. We present a modular coreference resolution pipeline that allows for fine-grained error analysis. We show that our approach is competitive with state-of-the-art models and scales effectively to long documents. Finally, we demonstrate its usefulness to infer the gender of fictional characters, showcasing its relevance for both literary analysis and downstream natural language processing tasks.
Paper Type: Long
Research Area: Discourse and Pragmatics
Research Area Keywords: coreference resolution, anaphora resolution, corpus creation, NLP datasets, named entity recognition and relation extraction, entity linking/disambiguation
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: French
Submission Number: 1362
Loading