Abstract: Danish natural language processing (NLP) has in recent years obtained considerable improvements
with the addition of multiple new datasets and models. However, at present, there is no coherent
framework for applying state-of-the-art models for Danish. We present DaCy: a unified framework
for Danish NLP built on SpaCy. DaCy uses efficient multitask models which obtain state-of-the-art
performance on named entity recognition, part-of-speech tagging, and dependency parsing. DaCy
contains tools for easy integration of existing models such as for polarity, emotion, or subjectivity
detection. In addition, we conduct a series of tests for biases and robustness of Danish NLP pipelines
through augmentation of the test set of DaNE. DaCy large compares favorably and is especially
robust to long input lengths and spelling variations and errors. All models except DaCy large display
significant biases related to ethnicity while only Polyglot shows a significant gender bias. We argue
that for languages with limited benchmark sets, data augmentation can be particularly useful for
obtaining more realistic and fine-grained performance estimates. We provide a series of augmenters
as a first step towards a more thorough evaluation of language models for low and medium resource
languages and encourage further development.
0 Replies
Loading