Evaluating Word Embeddings on Low-Resource Languages

Nathan Stringham, Mike Izbicki

Published: 01 Jan 2020, Last Modified: 30 Jul 2025Eval4NLP 2020EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The analogy task introduced by Mikolov et al. (2013) has become the standard metric for tuning the hyperparameters of word embedding models. In this paper, however, we argue that the analogy task is unsuitable for low-resource languages for two reasons: (1) it requires that word embeddings be trained on large amounts of text, and (2) analogies may not be well-defined in some low-resource settings. We solve these problems by introducing the OddOneOut and Topk tasks, which are specifically designed for model selection in the low-resource setting. We use these metrics to successfully tune hyperparameters for a low-resource emoji embedding task and word embeddings on 16 extinct languages. The largest of these languages (Ancient Hebrew) has a 41 million token dataset, and the smallest (Old Gujarati) has only a 1813 token dataset.

External IDs:dblp:conf/eval4nlp/StringhamI20