Keywords: low-resource NLP, Modern Standard Arabic, North African Arabic dialects, language modeling, transfer learning, vocabulary efficiency, computational linguistics
TL;DR: We aim to develop language models for low-resource languages, particularly North African Arabic dialects, leveraging only available formal language corpora.
Abstract: Arabic dialects present major challenges for natural language processing (NLP) due to their diglossic nature, phonetic variability, and the scarcity of resources. To address this, we introduce a phoneme-like transcription approach that enables the training of robust language models for North African Dialects (NADs) using only formal language data, without the need for dialect-specific corpora.
Our key insight is that Arabic dialects are highly phonetic, with NADs particularly influenced by European languages. This motivated us to develop a novel approach in which we convert Arabic script into a Latin-based representation, allowing our language model, ABDUL, to benefit from existing Latin-script corpora.
Our method demonstrates strong performance in multi-label emotion classification and named entity recognition (NER) across various Arabic dialects. ABDUL achieves results comparable to or better than specialized and multilingual models such as DarijaBERT, DziriBERT, and mBERT. Notably, in the NER task, ABDUL outperforms mBERT by 5% in F1-score for Modern Standard Arabic (MSA), Moroccan, and Algerian Arabic, despite using a vocabulary four times smaller than mBERT.
Archival: Archival Track
Participation: Virtual
Presenter: Yassine Toughrai
Submission Number: 7
Loading