Track: Scientific Track
Keywords: Language Identification, Romansh, Low-Resource Languages, Dialect Identification
TL;DR: Language identification for the six Romansh varieties using character n-gram features, with a newly curated multi-domain benchmark.
Abstract: The Romansh language has several regional varieties, called idioms, which sometimes have limited mutual intelligibility. This linguistic diversity motivates the need for a language identification (LID) system that can distinguish between these idioms, yet to date there has been no well-documented effort to build one. Since Romansh LID should also be able to recognize Rumantsch Grischun, a supra-regional variety that combines elements of several idioms, this makes for a novel and interesting classification problem. In this paper, we present a LID system for Romansh idioms based on an SVM approach. We evaluate our model on a newly curated benchmark across two domains and find that it reaches an average in-domain accuracy of 97%, enabling applications such as idiom-aware spell checking or machine translation. Our classifier is publicly available.
Submission Number: 21
Loading