Unsupervised Morphology-Based Vocabulary ExpansionDownload PDF

2014 (modified: 16 Jul 2019)ACL (1) 2014Readers: Everyone
Abstract: We present a novel way of generating unseen words, which is useful for certain applications such as automatic speech recognition or optical character recognition in low-resource languages. We test our vocabulary generator on seven low-resource languages by measuring the decrease in out-of-vocabulary word rate on a held-out test set. The languages we study have very different morphological properties; we show how our results differ depending on the morphological complexity of the language. In our best result (on Assamese), our approach can predict 29% of the token-based out-of-vocabulary with a small amount of unlabeled training data.
0 Replies

Loading