Harvest Uyghur-Chinese Aligned-Sentences Bitexts from Multilingual Sites Based on Word EmbeddingOpen Website

2017 (modified: 15 Nov 2021)CCL 2017Readers: Everyone
Abstract: Obtaining bilingual parallel data from the multilingual websites is a long-standing research problem, which is very benefit for resource-scarce languages. In this paper, we present an approach for obtaining parallel data based on word embedding, and our model only rely on a small scale of bilingual lexicon. Our approach benefit from the recent advances of continuous word representations, which can reveal more context information compared with traditional methods. Our experiments show that high-precision and sizable parallel Uyghur-Chinese data can be obtained for lacking bilingual lexicon.
0 Replies

Loading