Vorm: Translations and a constrained hypothesis space support unsupervised morphological segmentation across languages
Keywords: unsupervised morphological segmentation, low-resource languages, morphological typology, reduplication, canonical segmentation
TL;DR: A new unsupervised morphological segmentation system, leveraging translation data, does well on canonical segmentation
Abstract: This paper introduces Vorm, an unsupervised morphological segmentation system, leveraging translation data to infer highly accurate morphological transformations, including less-frequently modeled processes such as infixation and reduplication. The system is evaluated on standard benchmark data and a novel, typologically diverse, dataset of 37 languages. Model performance is competitive and sometimes superior on canonical segmentation, but more limited on surface segmentation.
Copyright Agreement: pdf
Submission Number: 215
Loading