Keywords: benchmarking, language resources, evaluation methodologies, datasets for low resource languages
Abstract: This survey provides the first systematic review of Arabic benchmarks that target the evaluation of LLMs with Arabic support. It showcases an analysis of 40+ Arabic evaluation benchmarks across NLP tasks, knowledge domains, cultural understanding, and specialized capabilities. A taxonomy organizing benchmarks into four categories is proposed: Knowledge, NLP Tasks, Culture and Dialects, and Target-Specific evaluations. The analysis reveals significant progress in Arabic benchmark diversity while identifying critical gaps: limited temporal evaluation, insufficient multi-turn dialogue assessment, and cultural misalignment in translated datasets. Three primary approaches are examined: native collections, translated collections, and synthetically generated collections. Their trade-offs regarding authenticity, scale, and cost are discussed. This work serves as a comprehensive reference for Arabic NLP researchers, providing insights into benchmark methodologies, reproducibility standards, and evaluation metrics while offering recommendations for future development.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Survey, Arabic Benchmarks
Contribution Types: Surveys
Languages Studied: Arabic
Submission Number: 9724
Loading