Technical vs Cultural: Evaluating LLMs in Arabic

Published: 24 Nov 2025, Last Modified: 24 Nov 20255th Muslims in ML Workshop co-located with NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs; Arabic NLP; cultural sensitivity; holistic evaluation
TL;DR: We evaluate 5 LLMs on Arabic tasks using a 4-dimensional framework, finding that frontier models excel in technical accuracy while specialized Arabic models like Fanar shine in cultural competency and linguistic fluency.
Abstract: We present a pilot evaluation framework for language models in Arabic, revealing nuanced performance patterns across technical and cultural dimensions. We evaluate five prominent models—Arabic-specialized systems (Fanar, Falcon 3) and frontier models (Claude Opus, GPT-5, Llama)—across a small set of 45 prompts spanning general knowledge, trust and safety, and mathematical reasoning. Using four-dimensional scoring, we find varied performance patterns. While Claude (and frontier models in general) excel in technical accuracy, Arabic-specialized models demonstrate competitive cultural context and language quality, with Fanar showing strong linguistic competency. Mathematical reasoning emerges as the primary technical differentiator, while cultural competency shows less variation between specialized and frontier models than initially hypothesized. These findings highlight the need for new assessment approaches as new models emerge and the importance of balancing technical accuracy with cultural and linguistic fluency, suggesting domain-specific optimization may be more effective than broad specialization.
Track: Track 1: ML on Islamic Content / ML for Muslim Communities
Submission Number: 53
Loading