ArabicKT: A Comprehensive Arabic Knowledge Evaluation Suite for Large Language Models

ACL ARR 2025 February Submission8399 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The evaluation of large language models (LLMs) is crucial for understanding their capabilities, yet current methods rely heavily on manually created benchmarks that cover only a small fraction of specific knowledge. To address this gap, we propose an automated approach to generate comprehensive evaluation data and introduce ArabicKT, an Arabic Knowledge Taxonomy derived from Wikipedia and Wikidata. ArabicKT organizes $140,433$ categories and $1.67$ million articles into a $15$-layer tree structure, covering $77\%$ of the Arabic pre-training corpus and $84\%$ of existing Arabic benchmarks. Leveraging LLMs, we developed an automated pipeline to generate $6$ million question-answer pairs for Arab-world knowledge. Our experiments reveal two key insights: (1) Models perform better on knowledge points that appear more frequently in training data, and (2) larger models exhibit superior mastery of granular cultural, religious, and historical knowledge. These findings indicates the importance of training data distribution and model scale in domain-specific knowledge acquisition, offering actionable guidance for improving LLMs in underexplored areas.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Large Language Models, Knowlege System, Arabic Evaluation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 8399
Loading