Abstract: Bilingual word representations (BWRs) play a very key role in many natural language processing (NLP) tasks, especially cross-lingual applications such as machine translation and cross-lingual information retrieval et al. Most existing methods are based on offline unsupervised methods to learn BWRs. Those offline methods mainly rely on the isomorphic assumption that word representations have a similar distribution for different languages. Several authors also question this assumption and argue that word representation spaces are non-isomorphic for many language pairs. In this paper, we adopt a novel unsupervised method to implement joint training BWRs. We first use a dynamic programming algorithm to detect continuous bilingual segments. Then, we use the extracted bilingual data and monolingual corpora to train BWRs jointly. Experiments show that our approach improves the performance of BWRs compared with several baselines in the real-world dataset.(By unsupervised, we mean that no cross-lingual resources like parallel text or bilingual lexicons are directly used.)
0 Replies
Loading