AH-Translit: A Multi-Domain Dataset and Benchmark for Arabic-to-Hindi Transliteration

Published: 24 Nov 2025, Last Modified: 24 Nov 20255th Muslims in ML Workshop co-located with NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: transliteration, arabic-to-hindi, dataset, benchmark
TL;DR: We built *AH-Translit*, the first large-scale dataset for Arabic-to-Hindi transliteration, and used it to train and benchark models to show linguistic diversity and difficulty of our benchmark.
Abstract: The lack of public data for Arabic-to-Hindi transliteration has hindered the development of systems that can handle the languages' diverse linguistic styles. To address this, we introduce \ahtranslit{}, a multi-domain dataset of $100\mathrm{K}$ parallel pairs with over $1.2\mathrm{M}$ Arabic and $1.5\mathrm{M}$ Hindi words. We also present \bm{$\mathcal{AH}$}\textbf{-\textit{Translit}-Bench}\footnote{The benchmark data is available at: \href{https://india-data.org/dataset-details/759e2466-b6d4-460a-a1fe-61207e885b1f}{AH\_TB Data}}, a balanced, human-verified benchmark for fair evaluations across diverse linguistic domains. Our analysis reveals that domain-specific models, while strong in-domain, generalize poorly. We show that a single model, trained on a balanced mixture, achieves higher performance consistency across all domains. This approach establishes a strong baseline with a Macro-averaged Character Error Rate (MaCER) of \emph{15.7\%}. We release the benchmark and an \href{https://pypi.org/project/AH-Translit-Bench/}{evaluation package} for reproducible, cross-domain assessment.
Track: Track 1: ML on Islamic Content / ML for Muslim Communities
Submission Number: 38
Loading