ArabKT: A Comprehensive Arabic Knowledge Evaluation Suite for Large Language Models

ArabKT: A Comprehensive Arabic Knowledge Evaluation Suite for Large Language Models

ACL ARR 2025 May Submission1675 Authors

18 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The evaluation of large language models (LLMs) is crucial for understanding their capabilities, yet current methods rely heavily on manually created benchmarks that cover only a small fraction of specific knowledge. To address this gap, we propose an automated approach to generate comprehensive evaluation data and introduce ArabKT, an Arab-world Knowledge Taxonomy derived from Wikipedia and Wikidata. ArabKT organizes $140,433$ categories and $1.67$ million articles into a $15$-layer tree structure, covering $77\%$ of the Arabic pre-training corpus and 84% of existing Arabic benchmarks. Leveraging LLMs, we developed an automated pipeline to generate $6$ million question-answer pairs for Arab-world knowledge. Our experiments reveal two key insights: (1) LLMs consistently struggle with religiously sensitive topics and cognitive conflicts, requiring further alignment and native feedback. (2) Larger models show no advantage in handling niche expertise and interdisciplinary domains, in which data acquisition shows higher priority than scaling the model. These findings provide statistical evidence and actionable guidance for improving LLMs in underexplored areas.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: datasets for low resource languages, automatic creation and evaluation of language resources, evaluation

Contribution Types: NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

Submission Number: 1675

Loading