NovelCR: A Large-Scale Bilingual Dataset Tailored for Long-Span Coreference Resolution

ACL ARR 2025 February Submission6300 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Coreference resolution (CR) endeavors to match pronouns, noun phrases, etc. with their referent entities, acting as an important step for deep text understanding. Presently available CR datasets are either small in scale or restrict coreference resolution to a limited text span. In this paper, we present NovelCR, a large-scale bilingual benchmark designed for long-span coreference resolution. NovelCR features extensive annotations, including 148k mentions in NovelCR-en and 311k mentions in NovelCR-zh. Moreover, the dataset is notably rich in long-span coreference pairs, with 85\% of pairs in NovelCR-en and 83\% in NovelCR-zh spanning across three or more sentences. Experiments on NovelCR reveal a large gap between state-of-the-art baselines and human performance, highlighting that NovelCR remains an open issue.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Long-span Coreference resolution; Bilingual;
Contribution Types: Data resources
Languages Studied: English;Chinese
Submission Number: 6300
Loading