Word Representation Models for Arabic Dialect Identification

Mahmoud Sobhy, Ahmed H. Abu El-Atta, Ahmed El-Sawy, Hamada Nayel

Published: 2022, Last Modified: 29 Apr 2024WANLP@EMNLP 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This paper describes the systems submitted by BFCAI team to Nuanced Arabic Dialect Identification (NADI) shared task 2022. Dialect identification task aims at detecting the source variant of a given text or speech segment automatically. There are two subtasks in NADI 2022, the first subtask for country-level identification and the second subtask for sentiment analysis. Our team participated in the first subtask. The proposed systems use Term Frequency Inverse/Document Frequency and word embeddings as vectorization models. Different machine learning algorithms have been used as classifiers. The proposed systems have been tested on two test sets: Test-A and Test-B. The proposed models achieved Macro-f1 score of 21.25% and 9.71% for Test-A and Test-B set respectively. On other hand, the best-performed submitted system achieved Macro-f1 score of 36.48% and 18.95% for Test-A and Test-B set respectively.