Identifying Token-Level Dialectal Features in Social MediaDownload PDF

Published: 20 Mar 2023, Last Modified: 18 Apr 2023NoDaLiDa 2023Readers: Everyone
Keywords: dialectal features, token-level dialectal features, dialect identification
TL;DR: We introduce the task of token-level dialectal feature prediction. We provide annotation guidelines for Norwegian dialects and a manually annotated corpus. We also evaluate the learnability of the task by conducting various labelling experiments.
Abstract: Dialectal variation is present in many human languages and is attracting a growing interest in NLP. Most previous work concentrated on either (1) classifying dialectal varieties at the document or sentence level or (2) performing standard NLP tasks on dialectal data. In this paper, we propose the novel task of token-level dialectal feature prediction. We present a set of fine-grained annotation guidelines for Norwegian dialects, expand a corpus of dialectal tweets, and manually annotate them using the introduced guidelines. Furthermore, to evaluate the learnability of our task, we conduct labeling experiments using a collection of baselines, weakly supervised and supervised sequence labeling models. The obtained results show that, despite the difficulty of the task and the scarcity of training data, many dialectal features can be predicted with reasonably high accuracy.
4 Replies

Loading