Abstract: Bilingual Lexicon Induction (BLI) aims at inducing word translations in two distinct languages. The generated bilingual dictionaries via BLI are essential for cross-lingual NLP applications. Most existing methods assume that a mapping matrix can be learned to project the embedding of a word in the source language to that of a word in the target language which shares the same meaning. However, a single matrix may not be able to provide sufficiently large parameter space and to tailor to the semantics of words across different domains and topics due to the complicated nature of linguistic regularities. In this paper, we propose a Soft Piecewise Mapping Model (SPMM). It generates word alignments in two languages by learning multiple mapping matrices with orthogonal constraint. Each matrix encodes the embedding translation knowledge over a distribution of latent topics in the embedding spaces. Such learning problem can be formulated as an extended version of the Wahba's problem, with a closed-form solution derived. To address the limited size of training data for low-resourced languages and emerging domains, an iterative boosting method based on SPMM is used to augment training dictionaries. Experiments conducted on both general and domain-specific corpora show that SPMM is effective and outperforms previous methods.
0 Replies
Loading