NovelCR: A Large-Scale Bilingual Dataset Tailored for Long-Span Coreference Resolution

NovelCR: A Large-Scale Bilingual Dataset Tailored for Long-Span Coreference Resolution

ACL ARR 2024 August Submission374 Authors

16 Aug 2024 (modified: 07 Sept 2024)ACL ARR 2024 August SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Coreference resolution (CR) endeavors to match pronouns, noun phrases, etc. with their referent entities, acting as an important step for deep text understanding. Presently available CR datasets are either small in scale or restrict coreference resolution to a limited text span. In this paper, we present NovelCR, a large-scale bilingual benchmark trailer for long-span coreference resolution. NovelCR not only contains extensive mentions and coreferences annotations (148k mentions and 128k coreferences in NovelCR-en, 311k mentions and 273k coreferences in NovelCR-zh), but also contains numerous long-span coreferences. Specifically, 74\% of the coreferences in NovelCR-en and 83\% of the coreferences in NovelCR-zh span over three or more sentences, which is significantly higher than the proportion of long-span coreferences in existing datasets. Experiments on NovelCR reveal a large gap between state-of-the-art baselines and human performance, highlighting that NovelCR remains an open issue.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: corpus creation; benchmarking; multilingual corpora; NLP datasets;

Contribution Types: Data resources

Languages Studied: English; Chinese

Submission Number: 374

Loading