sPhinX: Sample Efficient Multilingual Instruction Fine-Tuning Through N-shot Guided Prompting

ACL ARR 2025 February Submission2897 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Despite the remarkable success of large language models (LLMs) in English, a significant performance gap remains in non-English languages. To address this, we introduce a novel approach for strategically constructing a multilingual synthetic instruction tuning dataset, sPhinX. Unlike prior methods that directly translate fixed instruction-response pairs, sPhinX enhances diversity by selectively augmenting English instruction-response pairs with multilingual translations. Additionally, we propose LANGIT, a novel N-shot guided fine-tuning strategy, which further enhances model performance by incorporating contextually relevant examples in each training sample. Our ablation study shows that our approach enhances the multilingual capabilities of Mistral-7B and Phi-3-Small improving performance by an average of 39.8% and 11.2%, respectively, across multilingual benchmarks in reasoning, question answering, reading comprehension, and machine translation. Moreover, sPhinX maintains strong performance on English LLM benchmarks while exhibiting minimal to no catastrophic forgetting, even when trained on 51 languages.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: mixed language,multilingualism,cross-lingual transfer,multilingual pre-training,multilingual evaluation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Data resources
Languages Studied: English, Spanish, Chinese Simplified, Japanese, French, German, Portuguese, Italian, Dutch, Swedish, Danish, Finnish, Russian, Norwegian, Korean, Chinese Traditional, Polish, Turkish, Arabic, Hebrew, Portuguese, Czech, Hungarian, Indonesian, Thai, Greek, Slovak, Vietnamese, Slovenian, Croatian, Romanian, Lithuanian, Bulgarian, Serbian, Latvian, Ukranian, Estonian, Hindi, Burmese, Bengali, Afrikaans, Punjabi, Welsh, Icelandic, Marathi, Swahili, Nepali, Urdu, Telugu, Malayalam, Russian, Tamil, Oriya
Submission Number: 2897
Loading