Russian Neural Morpheme Segmentation: From Lemmata to Wordforms

Dmitry Morozov, Olga Shcherbakova, Anna Glazkova

Published: 01 Jan 2026, Last Modified: 05 Jan 2026CrossrefEveryoneRevisionsCC BY-SA 4.0
Abstract: Among the algorithms for automated morpheme segmentation for the Russian language, the best-performing ones are based on machine learning. The quality of these algorithms is currently close to expert-level. However, most research focuses on the analysis of lemmata rather than word forms. In this study, we compare state-of-the-art methods for morpheme segmentation using a prepared dataset of word forms. We evaluate three approaches: an ensemble of convolutional neural networks, a subword Transformer model DeepSPIN-3, and RuRoberta-based Morphberta models. To assess the robustness of these models, we employed multiple strategies for splitting the dataset into training and test sets, specifically to examine how performance degrades when handling out-of-vocabulary lemmata and roots. The best results were achieved using Morphberta models (over 99.5% completely accurate segmentations). However, our findings also demonstrate that random dataset splitting does not provide a comprehensive understanding of quality of the algorithm. Specifically, when dealing without-of-vocabulary morphemes, segmentation accuracy significantly declines, with the extent and nature of the decline varying across algorithms.
Loading