Inducing Bilingual Word Representations for Non-isomorphic Spaces by an Unsupervised Way

ShaoLin Zhu, Chenggang Mi, Linlin Zhang

2021 (modified: 15 Nov 2021)KSEM 2021Readers: Everyone

Abstract: Bilingual word representations (BWRs) play a very key role in many natural language processing (NLP) tasks, especially cross-lingual applications such as machine translation and cross-lingual information retrieval et al. Most existing methods are based on offline unsupervised methods to learn BWRs. Those offline methods mainly rely on the isomorphic assumption that word representations have a similar distribution for different languages. Several authors also question this assumption and argue that word representation spaces are non-isomorphic for many language pairs. In this paper, we adopt a novel unsupervised method to implement joint training BWRs. We first use a dynamic programming algorithm to detect continuous bilingual segments. Then, we use the extracted bilingual data and monolingual corpora to train BWRs jointly. Experiments show that our approach improves the performance of BWRs compared with several baselines in the real-world dataset.(By unsupervised, we mean that no cross-lingual resources like parallel text or bilingual lexicons are directly used.)

0 Replies