DanTok: Domain Beats Language for Danish Social Media POS TaggingDownload PDF

Published: 20 Mar 2023, Last Modified: 11 Apr 2023NoDaLiDa 2023Readers: Everyone
Keywords: pos tagging, social media, tiktok, domain adaptation, cross-lingual learning, Danish
TL;DR: The best strategy for obtaining a high-quality tagger is using domain-specific models when available, even if multilingual.
Abstract: Language from social media remains challenging to process automatically, especially for non-English languages. In this work, we introduce the first NLP dataset for TikTok comments and the first Danish social media dataset with part-of-speech annotation. We further supply annotations for normalization, code-switching, and annotator uncertainty. As transferring models to such a highly specialized domain is non-trivial, we conduct an extensive study into which source data and modeling decisions most impact the performance. Surprisingly, transferring from in-domain data, even from a different language, outperforms in-language, out-of-domain training. These benefits nonetheless rely on the underlying language models having been at least partially pre-trained on data from the target language. Using our additional annotation layers, we further analyze how normalization, code-switching, and human uncertainty affect the tagging accuracy.
Student Paper: Yes, the first author is a student
4 Replies

Loading