The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments

ACL ARR 2024 June Submission3213 Authors

15 Jun 2024 (modified: 10 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Multilinguality is crucial for extending recent advancements in language modelling to diverse linguistic communities. To maintain high performance while representing multiple languages, multilingual models ideally align representations, allowing what is learned in one language to generalise to others. Prior research has emphasised the importance of parallel data and shared vocabulary elements as key factors for such alignment. In this study, we investigate an unintuitive novel driver of cross-lingual generalisa- tion: language imbalance. In controlled experiments on perfectly equivalent cloned languages, we observe that the existence of a predominant language during training boosts the performance of less frequent languages and leads to stronger alignment of model representations across languages. Further- more, we find that this trend is amplified with scale: with large enough mod- els or long enough training, we observe that bilingual training data with a 90/10 language split yields strictly better performance on both languages than a balanced 50/50 split. Building on these insights, we design training schemes that can improve performance in all cloned languages, even without alter- ing the training data. As we extend our analysis to real languages, we find that infrequent languages still benefit from frequent ones, yet whether language imbalance causes cross-lingual generalisation there is not conclusive.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: multilingualism,multilingual representations,multilingual pre-training
Contribution Types: Model analysis & interpretability
Languages Studied: English,French
Submission Number: 3213
Loading