Benchmarking Generative AI on Quranic Knowledge

Published: 24 Nov 2025, Last Modified: 24 Nov 20255th Muslims in ML Workshop co-located with NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: GenAI, Quranic QA, LLM Benchmarking
Abstract: This paper evaluates the performance of large language models (LLMs) and embedding-based retrieval systems in answering Quranic questions, a task demanding both semantic understanding and theological grounding. The Quran's complex rhetorical structure, contextual depth, and inter-verse coherence pose challenges for general-purpose models. To address this, we introduce a human-reviewed benchmark of 881 multiple-choice questions derived from 200 Quranic verses, stratified by five cognitive reasoning levels (using Bloom's Taxonomy) and four familiarity tiers based on verse perplexity. We assess model performance on two tasks: (1) multiple-choice QA (semantic comprehension), and (2) verse identification (reference grounding). Results show that instruction-tuned LLMs such as Fanar-1-9B achieve 41\% accuracy on MCQs and 15.6\% top-1 verse identification accuracy, with a marked decline from low-complexity (``Remember'') to high-complexity (``Evaluate'') questions. Conversely, a dense retriever achieves 45.1\% top-5 accuracy and an MRR of 0.341, with particularly strong performance on familiar and low-level questions (e.g., 73\% on ``Remember'', 57\% on low-perplexity verses).
Submission Number: 79
Loading