Model-x: A Family of Arabic-Centric Open Large Language Models

Model-x: A Family of Arabic-Centric Open Large Language Models

ACL ARR 2026 January Submission9782 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Arabic-centric large language models, Low-resource and underrepresented languages, Arabic dialects and diglossia, Multilingual language modeling, Transformer-based language models, Cultural and linguistic alignment, Large-scale pretraining, Instruction tuning and preference alignment, Open-weight language models

Abstract: We present a family of Arabic-centric large language models representing the most capable and culturally aligned Arabic LLMs to date. The family includes the largest open Arabic-centric LLM trained from scratch at 70B parameters, and the best-in-class 8B-parameter LLM. A custom Arabic-centric vocabulary enables efficient training and inference. In addition, an optimized architecture and training recipe yield highly compute-efficient training. With a substantially smaller token budget than comparable models, our model achieves state-of-the-art Arabic performance and competitive English results. The models are best-performing on a key Arabic leaderboard: Open Arabic Leaderboard v2 (OALLv2) and AraGen v1. They are also leading in several benchmarks for domains deeply rooted in Arab life, such as poetry, religion, dream interpretation, as well as in general tasks such as translation and summarization. We release the models in HuggingFace under a commercially permissive license. By uniting scale, linguistic diversity, cultural fidelity, openness, and speed, our model establishes a transparent and inclusive foundation for the next generation of Arabic-centric high-performance LLMs. Both sizes of the model, as open-weights, are publicly available at HuggingFace: http://anonymous.for.review

Paper Type: Long

Research Area: Language Models

Research Area Keywords: applications, pre-training, fine-tuning, continual learning, prompting, safety and alignment, scaling

Contribution Types: Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources

Languages Studied: Arabic, English, Arabic Dialects

Submission Number: 9782

Loading