Abstract: Among the algorithms for automated morpheme segmentation for the Russian language, the best-performing ones are based on machine learning. The quality of these algorithms is currently close to expert-level. However, most research focuses on the analysis of lemmata rather than word forms. In this study, we compare state-of-the-art methods for morpheme segmentation using a prepared dataset of word forms. We evaluate three approaches: an ensemble of convolutional neural networks, a subword Transformer model DeepSPIN-3, and RuRoberta-based Morphberta models. To assess the robustness of these models, we employed multiple strategies for splitting the dataset into training and test sets, specifically to examine how performance degrades when handling out-of-vocabulary lemmata and roots. The best results were achieved using Morphberta models (over 99.5% completely accurate segmentations). However, our findings also demonstrate that random dataset splitting does not provide a comprehensive understanding of quality of the algorithm. Specifically, when dealing without-of-vocabulary morphemes, segmentation accuracy significantly declines, with the extent and nature of the decline varying across algorithms.
External IDs:doi:10.1007/978-3-032-04958-2_12
Loading