A Comprehensive Framework and Empirical Analysis for Evaluating Large Language Models in Arabic Dialect Identification
Abstract: The widespread interest in large language models (LLMs) is rooted in their remarkable capacity to generate human-like and contextually relevant responses. However, the precision of LLMs within specific domains or intricate tasks, such as Arabic dialect identification, remains largely unexplored. This task presents a substantial challenge in Arabic natural language processing, given its language-dependent nature. This paper provides a framework for evaluating LLMs for Arabic dialect identification and conducts a comprehensive evaluation of LLMs, employing both tuning-free and fine-tuning-based learning paradigms. The evaluation encompasses GPT-3.5, chatGPT, GPT-4, and Google BARD for Arabic dialect identification under the tuning-free learning paradigm. Furthermore, it assesses the performance of GPT-3.5 along with AraBERT and MARBERT using the fine-tuning learning paradigm. In the tuning-free approach, GPT-4 achieves the most favorable results, reporting an F1 MAC of 45.60%. Under the fine-tuning learning paradigm, both AraBERT and MARBERT exhibit comparable performance (around 50% F 1 MAC ) to GPT-3.5, without incurring any financial costs, in contrast to the expenses associated with GPT-3.5.
Loading