Multilingual Arbitrage: Optimizing Data Pools to Accelerate Multilingual Progress

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Synthetic data, Model Distillation, Multilingual language models, Routing, Instruction Fine-Tuning
TL;DR: We introduce "multilingual arbitrage", a method that optimizes synthetic data generation by routing samples through multiple models, significantly improving multilingual model performance.
Abstract: The use of synthetic data has been crucial in achieving recent state-of-the-art breakthroughs. However, relying solely on a single oracle teacher model for data generation can lead to issues such as model collapse and bias propagation. These problems are particularly pronounced in multilingual contexts, where no single teacher model performs optimally across all languages. In this study, we propose a solution through multilingual arbitrage, which exploits performance variations among multiple models for each language. By strategically routing samples through a diverse set of models, each possessing unique strengths in different languages, we address these challenges. Our extensive experiments with state-of-the-art models demonstrate that our arbitrage techniques significantly enhance performance compared to relying on a single teacher model. Our multilingual arbitrage techniques result in large gains of up to 80% win-rates over state-of-art proprietary and widely adopted open weight models such as Gemma 2, Llama 3.1, Mistral v0.3. These gains, achieved through multilingual arbitrage and averaged across all languages, were most substantial in the less-resourced languages within our pool.
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9793
Loading