ALLaM: A Series of Large Language Models for Arabic and English

ALLaM: A Series of Large Language Models for Arabic and English

ACL ARR 2024 June Submission3320 Authors

16 Jun 2024 (modified: 12 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In this work, we present **ALLaM**: **A**rabic **L**arge **La**nguage **M**odel, a series of large language models to support the ecosystem of Arabic Language Technologies (ALT). \model~ is carefully trained, considering the values of language alignment and transferability of knowledge at scale. The models are based on an autoregressive decoder-only architecture and are pretrained on a mixture of Arabic and English texts. We illustrate how the second-language acquisition via vocabulary expansion can help steer a language model towards a new language without any major catastrophic forgetting in English. Furthermore, we highlight the effectiveness of using translation data and the process of knowledge encoding within the language model's latent space. Finally, we show that effective alignment with human preferences can significantly enhance the performance of a large language model (LLM) compared to less aligned models of a larger scale. Our methodology enables us to achieve state-of-the-art performance in various Arabic benchmarks, including MMLU Arabic, ACVA, Arabic Exams. Our aligned models improve both in Arabic and English from its base aligned models. Our model is openly available via [Redacted]().

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: Large Language Model, English, Arabic, Second Language Acquisition

Contribution Types: NLP engineering experiment, Approaches to low-resource settings

Languages Studied: English, Arabic

Submission Number: 3320

Loading