Keywords: African Languages, Low-Resource Languages, Natural Language Processing, Multilingual Models, Text Classification, Empirical Study, Named Entity Recognition
TL;DR: A comprehensive study of the effects of a shared vocabulary space, cross-script pretraining, and high-resource transfer on the cross-lingual abilities of multilingual models in zero- and few-shot settings.
Abstract: Multilingual pretrained language models have been shown to work well on many languages, even those they were not originally pretrained on. Despite their empirical success in downstream tasks, there is still a gap in understanding of "what makes them tick''. In this paper, we try to understand the effects of sharing a vocabulary space on the cross-lingual abilities of a multilingual model. We train multiple monolingual and multilingual models and compare their effectiveness on downstream tasks. In monolingual models, a single language occupies the entire vocabulary space, limiting possible cross-lingual transfer. Whereas in a multilingual setting, the model benefits from cross-lingual transfer with a trade-off of having to split the vocabulary space between multiple languages. We present a comprehensive study of the effects of a shared vocabulary space, cross-script pretraining, and high-resource transfer on the cross-lingual abilities of multilingual models in zero- and few-shot settings. From our study, we observe that scaling the number of languages is beneficial for cross-lingual transfer in low-resource multilingual models up until a point, after which transfer effects saturate. We find that there is not much benefit from pretraining low-resource multilingual models with a high-resource language, and that cross-lingual transfer is possible even when the languages belong to different scripts. This empirical study is conducted in the context of three linguistically different low-resource African languages---Amharic, Hausa, and Swahili---and evaluation was performed on two different tasks, text classification and named entity recognition. During the course of our experiments, we also performed an audit of the quality of two common low-resource language corpora (Common Crawl and BBC News data).