Abstract: Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present ANON, a collection of high-quality multilingual monolingual and parallel corpora. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on ANON, demonstrating its value.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation, language resources, multilingual corpora, NLP datasets, datasets for low resource languages, reproducibility, LLM training
Contribution Types: Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English, 192 other languages
Submission Number: 2456
Loading