XL-Instruct: Synthetic Data for Cross-Lingual Open-Ended Generation

ACL ARR 2025 May Submission1304 Authors

17 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Cross-lingual open-ended generation---i.e. generating responses in a desired language different from that of the user's query---is an important yet understudied problem. We introduce XL-AlpacaEval, a new benchmark for evaluating cross-lingual generation capabilities in Large Language Models (LLMs), and propose XL-Instruct, a high-quality synthetic data generation method. Fine-tuning with just 8K XL-Instruct-generated instructions significantly improves model performance, increasing the win rate against GPT-4o-Mini from 7.4% to 21.5%, and improving on several fine-grained quality metrics. Additionally, models fine-tuned on XL-Instruct exhibit strong zero-shot transfer to both English-only and multilingual generation tasks. Given its consistent gains across the board, we strongly recommend incorporating XL-Instruct in the post-training pipeline of future multilingual LLMs. To facilitate further research, we will publicly and freely release the XL-Instruct and XL-AlpacaEval datasets, which constitute two of the few cross-lingual resources currently available in the literature.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: synthetic data, cross-lingual, multilinguality, open-ended generation, LLMs
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources
Languages Studied: German, Portuguese, Hungarian, Lithuanian, Irish, Maltese, Chinese, Hindi, French, Finnish, Turkish
Submission Number: 1304
Loading