Abstract: We offer a method of incorporating BERT embeddings into neural morpheme segmentation. We show that our method significantly improves over baseline on 6 typologically diverse languages (English, Finnish, Turkish, Estonian, Georgian and Zulu). Moreover, it establishes a new SOTA on 4 languages where language-specific models are available. We demonstrate that the key component of the performance is not only the BPE vocabulary of BERT, but also the embeddings themselves. Additionally, we show that a simpler pretraining task optimizing subword word2vec-like objective also reaches state-of-the-art performance on 4 of 6 languages considered.
0 Replies
Loading