Abstract: Large Language Models (LLMs) serve as the foundation of contemporary artificial intelligence systems. Recently, a diverse range of Arabic-centric LLMs has emerged, designed to align with the values and preferences of Arabic speakers and offering advanced capabilities such as instruction following, open-ended question answering, and information delivery. In this paper, we identify the limitations of existing Arabic LLM benchmarks, which rely exclusively on multiple-choice questions and thereby fails to adequately assess the text generation capabilities of LLMs. To address this shortcoming, we propose a new automated evaluation benchmark, CamelEval, that performs LLM-as-judge evaluation. CamelEval comprises three test suites to evaluate general instruction following, factuality, and cultural alignment. Each test suite contains 805 carefully curated challenging test cases that reflect the nuances of Arabic language and culture. We envision CamelEval as a tool to guide the development of future Arabic LLMs, serving over 400 million Arabic speakers by providing LLMs that not only communicate in their language but also understand their culture.
Paper Type: Short
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: multilingualism, evaluation
Contribution Types: Approaches to low-resource settings
Languages Studied: Arabic
Submission Number: 848
Loading