Keywords: Multilingual LLM alignment, Multilingual IFT dataset, Multilingual multi-turn IFT dataset
TL;DR: A multi-lingual, multi-turn, evolved instruction finetuning dataset that leads to state-of-the-art results on several multilingual evaluation benchmarks and multi-turn eval benchmarks.
Abstract: Instruction finetuning (IFT) is critical for aligning Large Language Models (LLMs) to follow instructions. Numerous effective IFT datasets have been proposed in the recent past, but most focus on rich resourced languages such as English. In this work, we propose a diverse, task taxonomy guided, fully synthetic Multilingual, Multi-turn evoled instruction finetuning dataset, called M2Lingual, to better align LLMs on a diverse set of languages and tasks. M2Lingual contains a total of 182K IFT pairs that are built upon diverse seeds collected from Aya collection and Aya dataset covering 70 languages, 19 NLP tasks and general instruction-response pairs. LLMs finetuned with M2Lingual substantially outperform the majority of existing multilingual IFT datasets. Importantly, LLMs trained with M2Lingual consistently competitive results across wide variety of evaluation benchmarks compared to existing multilingual IFT datasets that enable LLMs performance in only one or a few subset of the benchmarks. Specifically, LLMs finetuned with M2Lingual achieve strong performance on multi-turn evaluation benchmarks such as MT-bench and across wide-variety of multilingual tasks such as XQuAD, MGSM, TyDiQA, MLQA, XNLI and XLSUM. We show efficacy of M2Lingual across LLMs with different sizes, especially smaller LLMs with 1.8B size which benefit massively from our dataset. Lastly, we present key analyses to highlight importance of each synthesis step of M2Lingual.
Supplementary Material: zip
Submission Number: 2194
Loading