DANSK and DaCy 2.6.0: Domain generalization of Danish named entity recognitionDownload PDF

Anonymous

16 Aug 2023ACL ARR 2023 August Blind SubmissionReaders: Everyone
Abstract: Named entity recognition is one of the cornerstones of Danish NLP, useful for providing insights within both industry and research. However, the field is inhibited by a lack of available datasets. As a consequence, no models are capable of fine-grained named entity recognition, nor have they been evaluated for potential generalizability issues across datasets and domains. To alleviate these limitations, this paper introduces: 1) DANSK; a named entity dataset providing for high-granularity tagging as well as within-domain evaluation of models across a diverse set of domains; 2) DaCy 2.6.0 that includes three generalizable models with fine-grained annotation and, 3) an evaluation of current state-of-the-art models' ability to generalize across domains. The evaluation of existing and new models revealed notable performance discrepancies across domains, which should be addressed within the field. Shortcomings of the annotation quality of the dataset and its impact on model training and evaluation are also discussed. Despite these limitation, we advocate for the use of the new dataset DANSK alongside further work on the generalizability within Danish NER.
Paper Type: long
Research Area: Resources and Evaluation
Languages Studied: Danish
0 Replies

Loading