LuxInstruct: A Cross-Lingual Instruction Tuning Dataset For Luxembourgish

ACL ARR 2025 May Submission5520 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Instruction tuning has become a key technique for enhancing the performance of large language models, enabling them to better follow human prompts. However, low-resource languages such as Luxembourgish face severe limitations due to the lack of high-quality instruction datasets. Traditional reliance on machine translation often introduces semantic misalignment and cultural inaccuracies. In this work, we address these challenges by creating a cross-lingual instruction tuning dataset for Luxembourgish without the use of machine translation. Instead, by leveraging aligned data from English, French, and German, we build a high-quality dataset that preserves linguistic and cultural nuances. We provide evidence that this cross-lingual approach not only circumvents common translation pitfalls but also leads to higher cross-lingual alignment within LLMs. This alignment is essential for enabling effective transfer to low-resource languages such as Luxembourgish. Therefore, our results advocate for data curation strategies that prioritize linguistic integrity over automated translation.
Paper Type: Short
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: less-resourced languages, resources for less-resourced languages, multilingual representations
Contribution Types: Approaches to low-resource settings, Data resources
Languages Studied: Luxembourgish, French, German, English
Submission Number: 5520
Loading