Cultural Benchmarking of LLMs in MSA and Arabic Dialectal Dialogue

Cultural Benchmarking of LLMs in MSA and Arabic Dialectal Dialogue

ACL ARR 2026 January Submission10215 Authors

06 Jan 2026 (modified: 07 Jun 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: arabic, dialect, msa, cultural reasoning, dialogue

Abstract: There is a significant gap in evaluating cultural reasoning in LLMs using conversational datasets that capture culturally rich and dialectal contexts. In Arabic NLP, most prior work focuses on Modern Standard Arabic (MSA) and short text snippets, overlooking the cultural nuances that naturally arise in dialogue. To address this gap, we introduce a culturally grounded conversational dataset covering 13 Arabic-speaking countries, including MSA and corresponding dialects, spanning 12 daily-life domains and 54 fine-grained subtopics. We define three tasks: (i) multiple-choice cultural reasoning, (ii) machine translation between MSA and dialects, and (iii) dialect-steering generation. Experiments with open-weight LLMs reveal substantial challenges: models struggle with dialectal data and perform significantly worse on all three tasks compared to MSA, highlighting the need for culturally aware dialogue systems.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: corpus creation, benchmarking, language resources, NLP datasets

Contribution Types: Approaches to low-resource settings, Data resources

Languages Studied: standard arabic (MSA), arabic dialect in Algeria Libya Morocco Tunisia Egypt Sudan Jordan Lebanon Palestine Syria KSA UAE Yemen

Submission Number: 10215

Loading