ArabKT: A Comprehensive Arabic Knowledge Evaluation Suite for Large Language Models

ACL ARR 2025 May Submission1675 Authors

18 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The evaluation of large language models (LLMs) is crucial for understanding their capabilities, yet current methods rely heavily on manually created benchmarks that cover only a small fraction of specific knowledge. To address this gap, we propose an automated approach to generate comprehensive evaluation data and introduce ArabKT, an Arab-world Knowledge Taxonomy derived from Wikipedia and Wikidata. ArabKT organizes $140,433$ categories and $1.67$ million articles into a $15$-layer tree structure, covering $77\%$ of the Arabic pre-training corpus and 84% of existing Arabic benchmarks. Leveraging LLMs, we developed an automated pipeline to generate $6$ million question-answer pairs for Arab-world knowledge. Our experiments reveal two key insights: (1) LLMs consistently struggle with religiously sensitive topics and cognitive conflicts, requiring further alignment and native feedback. (2) Larger models show no advantage in handling niche expertise and interdisciplinary domains, in which data acquisition shows higher priority than scaling the model. These findings provide statistical evidence and actionable guidance for improving LLMs in underexplored areas.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: datasets for low resource languages, automatic creation and evaluation of language resources, evaluation
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 1675
Loading