Diagnosing Language Inconsistency in Cross-Lingual Word Embeddings

Yoshinari Fujinuma, Jordan Boyd-Graber, Michael J. Paul

Sep 27, 2018 (modified: Nov 16, 2018) ICLR 2019 Conference Withdrawn Submission readers: everyone
  • Abstract: Cross-lingual embeddings encode meaning of words from different languages into a shared low-dimensional space. However, despite numerous applications, evaluation of such embeddings is limited. We focus on diagnosing the problem of words segregated by languages in cross-lingual embeddings. In an ideal cross-lingual embedding, word similarity should be independent of language---i.e., words within a language should not be more similar to each other than to words in another language. One test of this is modularity, a network measurement that measures the strength of clusters in a graph. When we apply this measure to a nearest neighbor graph, imperfect cross-lingual embeddings are sorted into modular, distinct regions. The correlation of this measurement with accuracy on two downstream tasks demonstrates that modularity can serve as an intrinsic metric of embedding quality.
  • Keywords: cross-lingual embeddings, evaluation, graph-based metric, modularity
0 Replies