Can LLMs Compute Zakat? A Symbolic Islamic Finance Benchmark for Cross-Lingual Islamic Finance

Mukhammed Togmanov; Fajri Koto

Can LLMs Compute Zakat? A Symbolic Islamic Finance Benchmark for Cross-Lingual Islamic Finance

Mukhammed Togmanov, Fajri Koto

Published: 14 Jun 2026, Last Modified: 21 Jun 2026ICML 2026 Workshop MusIML PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Islamic finance, Shariah, symbolic reasoning, cross-lingual benchmark, LLM evaluation, verifiable reasoning, zakat, faraid, sukuk, low-resource languages, cultural inclusion, multilingual NLP

TL;DR: First multilingual symbolic benchmark for Shariah-grounded Islamic finance reasoning — 7B math models beat frontier scale; boolean compliance predicates fall below chance.

Abstract: We introduce a verifiable, cross-lingual symbolic benchmark for evaluating large language models (LLMs) on rule-bound Islamic finance reasoning. The benchmark comprises 129 expert-validated templates grounded in formally specified AAOIFI Shariah rules, covering six operation categories — zakat (50), Islamic inheritance faraid (31), sukuk and ETB pricing (21), ijara leasing (16), istisna contracts (9), and murabaha financing (2) — and is realised through stratified parameter sampling into 6,450 English instances and 38,700 total cross-lingual instances across English, Arabic, Bahasa Indonesia, Urdu, Hindi, and Kazakh. Each instance ships with an executable verifier, enabling exact step-level scoring and ruling out the contamination concerns endemic to static benchmarks. Evaluating seven LLMs zero-shot, we find that frontier proprietary models lead overall (GPT-5.1: 61.8% FAC, Claude Sonnet 4.5: 60.9%, DeepSeek-V3.2: 58.4%), while math-specialised 7B models (MetaMath-7B: 57.1%, WizardMath-7B: 56.4%) outperform comparable-scale general-purpose open-weight models by 1.3–3.3 pp but remain below the frontier tier. Boolean predicate evaluation collapses below the 50% chance baseline (mean 23.4% FAC) across all models, and domain-specific errors — Hijri/Gregorian calendar conflation, nisab threshold confusion, mis-assigned heir shares — dominate the failure profile. The benchmark provides the first reproducible measurement of formal Shariah reasoning in multilingual LLMs.

Track: Track 1: ML Research Addressing Challenges Faced by Muslim Communities

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Non Archival Confirmation: I understand that submissions to MusIML are non-archival and can be submitted to other venues.

Submission Number: 46

Loading