Self-Distillation for Model Stacking Unlocks Cross-Lingual NLU in 200+ Languages

ACL ARR 2024 June Submission5047 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: LLMs have become a go-to solution not just for text generation, but also for natural language understanding (NLU) tasks. Acquiring extensive knowledge through language modeling on web-scale corpora, they excel on English NLU, yet struggle to extend their NLU capabilities to underrepresented languages. In contrast, machine translation models (MT) produce excellent multilingual representations, resulting in strong translation performance even for low-resource languages. MT encoders, however, lack the knowledge necessary for comprehensive NLU that LLMs obtain through language modeling training on immense corpora. In this work, we get the best both worlds by integrating MT encoders directly into LLM backbones via sample-efficient self-distillation. The resulting MT-LLMs preserve the inherent multilingual representational alignment from the MT encoder, allowing lower-resource languages to tap into the rich knowledge embedded in English-centric LLMs. Merging the MT encoder and LLM in a single model, we mitigate the propagation of translation errors and inference overhead of MT decoding inherent to discrete translation-based cross-lingual transfer (e.g., translate-test). Evaluation spanning three prominent NLU tasks and 127 predominantly low-resource languages renders MT-LLMs highly effective in cross-lingual transfer. MT-LLMs substantially and consistently outperform translation-test based on the same MT model, showing that we truly unlock multilingual language understanding for LLMs.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: multilingualism,cross-lingual transfer,multilingual representations,less-resourced languages
Contribution Types: NLP engineering experiment, Approaches to low-resource settings
Languages Studied: Achinese,Mesopotamian Arabic,Afrikaans,Tosk Albanian,Amharic,North Levantine Arabic,Standard Arabic,Najdi Arabic,Moroccan Arabic,Egyptian Arabic,Assamese,Aymara,North Azerbaijani,Bambara,Balinese,Batak Toba,Bengali,Banjar,Tibetan,Buginese,Bulgarian,Catalan,Cebuano,Czech,Central Kurdish,Danish,German,Modern Greek,English,Estonian,Basque,Finnish,French,Nigerian Fulfulde,West Central Oromo,Guarani,Gujarati,Haitian Creole,Hausa,Hebrew,Hindi,Croatian,Hungarian,Armenian,Igbo,Iloko,Indonesian,Icelandic,Italian,Javanese,Japanese,Kachin,Kannada,Georgian,Kazakh,Kabuverdianu,Halh Mongolian,Central Khmer,Kinyarwanda,Kirghiz,Korean,Southern Kirghiz,Lao,Lingala,Lithuanian,Ganda,Luo,Latvian,Madurese,Malayalam,Marathi,Minangkabau,Macedonian,Maltese,Maori,Burmese,Nijadali,Dutch,Norwegian Bokmål,Nepali,Pedi,Chewa,Oriya,Panjabi,Southern Pashto,Iranian Persian,Plateau Malagasy,Polish,Portuguese,Quechua,Romanian,Russian,Shan,Sinhala,Slovak,Slovenian,Shona,Sindhi,Somali,Sotho,Spanish,Serbian,Swati,Sundanese,Swahili,Swedish,Swahili,Tamil,Telugu,Tajik,Tagalog,Thai,Tigrinya,Tswana,Tsonga,Turkish,Ukrainian,Urdu,Northern Uzbek,Vietnamese,Waray,Wolof,Xhosa,Yoruba,Chinese,Standard Malay,Zulu
Submission Number: 5047
Loading