CW3NE:A Genre-oriented Corpus for Nested Named Entity Recognition in Chinese Web Novels

ACL ARR 2024 June Submission596 Authors

12 Jun 2024 (modified: 24 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Named entities are important to understand literary works, which emphasize characters, plots and environment. The research on named entity recognition (NER), especially nested named entity recognition in literary domain is still insufficient partly due to lack of enough annotated data. To address this issue, we construct the first Genre-oriented Corpus for $\textbf{N}$ested $\textbf{N}$amed $\textbf{E}$ntity Recognition in $\textbf{C}$hinese $\textbf{W}$eb $\textbf{N}$ovels, namely $\textbf{CW3NE}$, comprising 400 chapters totaling 1,214,283 tokens under two genres, XuanHuan (Eastern Fantasy) and History. Based on the corpus, we make a deep analysis of the distribution of different types of entities, including person, location and organization. We also make comparison of nesting patterns of nested entities between CW3NE and the English corpus LitBank. Even both belong to literary domain, entities in different genres share few overlaps, making genre adaptation of NER a hard problem. We provide several baseline NER methods and experimental results show that large language model based methods perform poorer than well designed small language model based method. Performance drops sharply on nested NER for all baseline methods, indicating the great challenge posed by the nested named entities. Genre adaptation also results in great performance drop especially on location and organization entities. We will release our corpus to promote research on literary NER.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Nested Named Entity Recognition, Genre-oriented, Chinese Web Novels
Contribution Types: Reproduction study, Data resources, Data analysis
Languages Studied: Chinese, English
Submission Number: 596