Isnad AI at IslamicEval 2025: A Rule-Based System for Identifying Religious Texts in LLM Outputs

Published: 11 Sept 2025, Last Modified: 18 Sept 2025IslamicEval @ ArabicNLP 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Islamic Citation Identification, LLM Output Verification, Rule-Based System, Synthetic Data Generation, AraBERTv2, Token Classification
TL;DR: The Isnad AI system introduces a novel rule-based method for synthetically generating a large-scale training corpus to fine-tune an AraBERTv2 model for the precise, character-level identification of Quranic and Hadith citations in LLM outputs.
Abstract: This paper presents the Isnad AI system developed for the IslamicEval 2025 Shared Task 1A, which focuses on identifying character-level spans of Quranic verses (Ayahs) and Prophetic sayings (Hadiths) within Large Language Model (LLM) outputs. This task is formulated as a token classification problem using a fine-tuned AraBERTv2 model. The primary contribution is a novel rule-based data preprocessing and augmentation pipeline, through which a large-scale, high-quality training corpus is systematically generated from raw religious texts. Through comprehensive ablation studies, it is demonstrated that the controlled synthetic data generation approach significantly outperforms traditional database lookup methods and basic fine-tuning approaches. The system achieved an F1 score of 66.97% in the official test set, demonstrating the effectiveness of principled synthetic data generation for specialized religious text verification tasks. To support reproducibility and future research in Islamic citation detection, all code, generated datasets, and experimental resources are made publicly available on GitHub and Hugging Face.
Submission Number: 8
Loading