BIICK-Bench: A Bengali Benchmark for Introductory Islamic Creed Knowledge in Large Language Models

Published: 24 Nov 2025, Last Modified: 24 Nov 20255th Muslims in ML Workshop co-located with NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Benchmark, Evaluation, Bengali, Islamic Knowledge, Multilingual NLP, AI Safety
TL;DR: This paper introduces a new benchmark for Islamic knowledge in the Bengali language and reveals that most state-of-the-art LLMs perform poorly, highlighting a critical gap in non-English, specialized domains.
Abstract: Large Language Models (LLMs) are increasingly used as information sources globally, yet their proficiency in specialized domains for non-English speakers remains critically under-evaluated. This paper introduces the Bengali Introductory Islamic Creed Knowledge Benchmark (BIICK-Bench), a novel, 50-question multiple-choice benchmark in the Bengali language, designed to assess the foundational Islamic knowledge of LLMs. Crucially, this work is an evaluation of knowledge retrieval and does not endorse seeking religious verdicts (fatwas) from LLMs, a role that must remain with qualified human scholars. Addressing the digital language divide, BIICK-Bench provides a vital tool for the world's second-largest Muslim linguistic community. Fourteen prominent open-source LLMs were evaluated, ranging from 2.5B to 8B parameters. The fully automated evaluation reveals a stark performance disparity, with accuracy scores ranging from 0\% to a high of 64\%. The results underscore that even state-of-the-art models struggle with Bengali Islamic knowledge, highlighting the urgent need for culturally and linguistically specific benchmarks to ensure the safe and reliable use of AI in diverse communities.
Track: Track 1: ML on Islamic Content / ML for Muslim Communities
Submission Number: 23
Loading