Evaluating Large Language Models in Arabic Tasks: A Survey of Benchmarks, Methods, and Gaps

Evaluating Large Language Models in Arabic Tasks: A Survey of Benchmarks, Methods, and Gaps

ACL ARR 2026 January Submission9724 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: benchmarking, language resources, evaluation methodologies, datasets for low resource languages

Abstract: This survey provides the first systematic review of Arabic benchmarks that target the evaluation of LLMs with Arabic support. It showcases an analysis of 40+ Arabic evaluation benchmarks across NLP tasks, knowledge domains, cultural understanding, and specialized capabilities. A taxonomy organizing benchmarks into four categories is proposed: Knowledge, NLP Tasks, Culture and Dialects, and Target-Specific evaluations. The analysis reveals significant progress in Arabic benchmark diversity while identifying critical gaps: limited temporal evaluation, insufficient multi-turn dialogue assessment, and cultural misalignment in translated datasets. Three primary approaches are examined: native collections, translated collections, and synthetically generated collections. Their trade-offs regarding authenticity, scale, and cost are discussed. This work serves as a comprehensive reference for Arabic NLP researchers, providing insights into benchmark methodologies, reproducibility standards, and evaluation metrics while offering recommendations for future development.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Survey, Arabic Benchmarks

Contribution Types: Surveys

Languages Studied: Arabic

Submission Number: 9724

Loading