Track: Scientific Track
Keywords: Romansh, Lemmatization, NLP, Low-Resource Languages, Minority Languages, Dictionary-Based
TL;DR: A dictionary-based Romansh Lemmatizer covering 80% of typical Romansh texts, with additional applications in Romansh variety classification, where it correctly identifies the variety in 95% of cases, and Romansh vs. non-Romansh text classification.
Abstract: Lemmatization – the task of mapping an inflected word form to its dictionary form – is a crucial component of many NLP applications. In this paper, we present RUMLEM, a lemmatizer that covers the five main varieties of Romansh as well as the supra-regional standard variety Rumantsch Grischun. It is based on comprehensive, community-driven morphological databases for Romansh, enabling RUMLEM to cover 77–84% of the words in a typical Romansh text. Since there is a dedicated database for each Romansh variety, an additional application of RUMLEM is variety-aware language classification. Evaluation on 30'000 Romansh texts of varying lengths shows that RUMLEM correctly identifies the variety in 95% of cases. In addition, a proof of concept demonstrates the feasibility of Romansh vs. non-Romansh language classification based on the lemmatizer.
Submission Number: 33
Loading