RUMLEM: A Dictionary-Based Lemmatizer for Romansh

Dominic P. Fischer; Zachary Hopton; Jannis Vamvas

RUMLEM: A Dictionary-Based Lemmatizer for Romansh

Dominic P. Fischer, Zachary Hopton, Jannis Vamvas

17 Mar 2026 (modified: 19 May 2026)SwissText 2026 Conference SubmissionEveryoneRevisionsCC BY 4.0

Track: Scientific Track

Keywords: Romansh, Lemmatization, NLP, Low-Resource Languages, Minority Languages, Dictionary-Based

TL;DR: A dictionary-based Romansh Lemmatizer covering 80% of typical Romansh texts, with additional applications in Romansh variety classification, where it correctly identifies the variety in 95% of cases, and Romansh vs. non-Romansh text classification.

Abstract: Lemmatization – the task of mapping an inflected word form to its dictionary form – is a crucial component of many NLP applications. In this paper, we present RUMLEM, a lemmatizer that covers the five main varieties of Romansh as well as the supra-regional standard variety Rumantsch Grischun. It is based on comprehensive, community-driven morphological databases for Romansh, enabling RUMLEM to cover 77–84% of the words in a typical Romansh text. Since there is a dedicated database for each Romansh variety, an additional application of RUMLEM is variety-aware language classification. Evaluation on 30'000 Romansh texts of varying lengths shows that RUMLEM correctly identifies the variety in 95% of cases. In addition, a proof of concept demonstrates the feasibility of Romansh vs. non-Romansh language classification based on the lemmatizer.

Submission Number: 33

Loading