The Tarteel Dataset: Crowd-Sourced and Labeled Quranic RecitationDownload PDF

07 Jun 2021 (modified: 24 May 2023)Submitted to NeurIPS 2021 Datasets and Benchmarks Track (Round 1)Readers: Everyone
Keywords: natural language processing, speech recognition, recitation, Arabic, Quran, crowd-sourced, diverse, dataset, Islam
TL;DR: We propose a schema for paired Quranic audio and text datasets and describe the collection, labeling, and usage of the Tarteel Quranic audio dataset.
Abstract: We propose a standard schema for paired Quranic audio and text datasets. We describe the collection, labeling, and validation of the Tarteel recitation dataset, the first large-scale dataset of Quranic recitation and accompanying Arabic text collected in a crowd-sourced manner. The dataset contains 25,000 audio clips totalling 67.39 hours of audio and represents a wide variety of recitation styles, proficiencies, and speeds. The data were collected over a period of six months from over 1,200 unique individuals of different ages, genders, and ethnicities. We describe the composition of the data and contributors, describe in detail how the data was collected and processed, and give some baseline performance for preliminary machine learning algorithms that were trained and evaluated on the dataset.
4 Replies
