BurhanAI at IslamicEval 2025 Shared Task: Combating Hallucinations in LLMs for Islamic Content; Evaluation, Correction, and Retrieval-Based Solution

Arij Al Adel; Abu Bakr Soliman; Mohamed Sakher Sawan; Rahaf Al-Najjar; Sameh Amin

BurhanAI at IslamicEval 2025 Shared Task: Combating Hallucinations in LLMs for Islamic Content; Evaluation, Correction, and Retrieval-Based Solution

Arij Al Adel, Abu Bakr Soliman, Mohamed Sakher Sawan, Rahaf Al-Najjar, Sameh Amin

Published: 11 Sept 2025, Last Modified: 22 Sept 2025IslamicEval @ ArabicNLP 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: RAG, Agent, GPT, LLMs, Fine-tuned Model, Search Engine, Hallucination Detection, Hallucination Correction, Span Detection, Validation, Correction, Hierarchical Indexing, Semantic Search, Vector Embeddings, Diacritic Normalization, Arabic, Quran, Hadith, Islamic Content, Low-resource Settings

TL;DR: Paper of 4 pages with appendix , 4 tables and one figure.

Abstract: In this paper, we describe our submission to the IslamicEval 2025 shared task, covering hallucination detection/correction and closedworld retrieval in Quranic and Hadith. We fine-tuned an LLM for detecting Quran and Hadith text spans, utilizing synthetic augmentation, diacritic variation, and morphological normalization to improve detection robustness (F1 = 87.10%) and used another reasoning model with tools (F1 = 90.06%). For validation, the accuracy is 88.60%, and for correction the accuracy is 66.56% where we employed a layered hierarchical index and search algorithm combining exact, normalized, fuzzy, and semantic matching with prompt-driven repair—to ensure canonical alignment and diacritic fidelity. For the correction stage, we also utilized a reasoning model with access to tools with an accuracy of 61.04%. Regarding the ranked answer-bearing text retrieval task, we implemented a Retrieval-Augmented Generation (RAG) system restricted to the corpora provided by the shared task, with structured output, vector-store grounding, and prompts tuned for “answer-enclosing” citations that achieve MAP@10 of 0.6199 on the development set and 0.2807 on the test set. The results highlight the value of normalization, corpus-restricted search, and reasoning models with tools in mitigating hallucinations and improving retrieval precision in low-resource religious settings and that much smaller fine-tuned models can compete with frontier models (e.g. GPT-5 high) for specialized tasks such as span detection.

Submission Number: 3

Loading