Abstract: We show that the choice of pretraining languages affects downstream cross-lingual transfer for BERT based models. We inspect zero-shot performance under balanced data conditions to mitigate data size confounds, classifying pretrain languages that increase downstream performance into donors, and languages that are most improved in zero-shot performance as recipients. We develop a method of quadratic time complexity in the number of pretraining languages to estimate these inter-language relations, instead of an exponential exhaustive computation of all possible combinations. We find that our method is effective on a diverse set of languages spanning different linguistic features and two downstream tasks.Our findings can inform developers of future large scale multilingual language models in choosing better pretraining configurations.
Paper Type: long
0 Replies
Loading