ALLaM: Large Language Models for Arabic and English

M Saiful Bari; Yazeed Alnumay; Norah A. Alzahrani; Nouf M. Alotaibi; Hisham Abdullah Alyahya; Sultan AlRashed; Faisal Abdulrahman Mirza; Shaykhah Z. Alsubaie; Hassan A. Alahmed; Ghadah Alabduljabbar; Raghad Alkhathran; Yousef Almushayqih; Raneem Alnajim; Salman Alsubaihi; Maryam Al Mansour; Saad Amin Hassan; Dr. Majed Alrubaian; Ali Alammari; Zaki Alawami; Abdulmohsen Al-Thubaity; Ahmed Abdelali; Jeril Kuriakose; Abdalghani Abujabal; Nora Al-Twairesh; Areeb Alowisheq; Haidar Khan

ALLaM: Large Language Models for Arabic and English

Published: 22 Jan 2025, Last Modified: 14 Mar 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, English, Arabic, Second Language Acquisition

TL;DR: We trained a foundational Arabic LLM by leveraging English LLMs.

Abstract: In this work, we present ALLaM: Arabic Large Language Model, a series of large language models to support the ecosystem of Arabic Language Technologies (ALT). ALLaM is carefully trained, considering the values of language alignment and transferability of knowledge at scale. The models are based on an autoregressive decoder-only architecture and are pretrained on a mixture of Arabic and English texts. We illustrate how the second-language acquisition via vocabulary expansion can help steer a language model towards a new language without any major catastrophic forgetting in English. Furthermore, we highlight the effectiveness of using translation data and the process of knowledge encoding within the language model's latent space. Finally, we show that effective alignment with human preferences can significantly enhance the performance of a large language model (LLM) compared to less aligned models of a larger scale. Our methodology enables us to achieve state-of-the-art performance in various Arabic benchmarks, including MMLU Arabic, ACVA, and Arabic Exams. Our aligned models improve both in Arabic and English from its base aligned models.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12255

Loading