Keywords: Arabic-centric large language models, Low-resource and underrepresented languages, Arabic dialects and diglossia, Multilingual language modeling, Transformer-based language models, Cultural and linguistic alignment, Large-scale pretraining, Instruction tuning and preference alignment, Open-weight language models
Abstract: We present a family of Arabic-centric large language models representing the most capable and culturally aligned Arabic LLMs to date. The family includes the largest open Arabic-centric LLM trained from scratch at 70B parameters, and the best-in-class 8B-parameter LLM. A custom Arabic-centric vocabulary enables efficient training and inference. In addition, an optimized architecture and training recipe yield highly compute-efficient training. With a substantially smaller token budget than comparable models, our model achieves state-of-the-art Arabic performance and competitive English results. The models are best-performing on a key Arabic leaderboard: Open Arabic Leaderboard v2 (OALLv2) and AraGen v1. They are also leading in several benchmarks for domains deeply rooted in Arab life, such as poetry, religion, dream interpretation, as well as in general tasks such as translation and summarization. We release the models in HuggingFace under a commercially permissive license. By uniting scale, linguistic diversity, cultural fidelity, openness, and speed, our model establishes a transparent and inclusive foundation for the next generation of Arabic-centric high-performance LLMs. Both sizes of the model, as open-weights, are publicly available at HuggingFace: http://anonymous.for.review
Paper Type: Long
Research Area: Language Models
Research Area Keywords: applications, pre-training, fine-tuning, continual learning, prompting, safety and alignment, scaling
Contribution Types: Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources
Languages Studied: Arabic, English, Arabic Dialects
Submission Number: 9782
Loading