Murreviikko - A Dialectologically Annotated and Normalized Dataset of Finnish TweetsDownload PDF

07 Aug 2023OpenReview Archive Direct UploadReaders: Everyone
Abstract: This paper presents Murreviikko, a dataset of dialectal Finnish tweets which have been dialectologically annotated and manually normalized to a standard form. The dataset can be used as a test set for dialect identification and dialect-to-standard normalization, for instance. We evaluate the dataset on the normalization task, comparing an existing normalization model built on a spoken dialect corpus and three newly trained models with different architectures. We find that there are significant differences in normalization difficulty between the dialects, and that a character-level statistical machine translation model performs best on the Murreviikko tweet dataset.
0 Replies

Loading