The Tarteel Dataset: Crowd-Sourced and Labeled Quranic Recitation

Hamzah I Khan; Abubakar Abid; Mohamed Medhat Moussa; Anas Abou-Allaban

The Tarteel Dataset: Crowd-Sourced and Labeled Quranic Recitation

Hamzah I Khan, Abubakar Abid, Mohamed Medhat Moussa, Anas Abou-Allaban

07 Jun 2021 (modified: 24 May 2023)Submitted to NeurIPS 2021 Datasets and Benchmarks Track (Round 1)Readers: Everyone

Keywords: natural language processing, speech recognition, recitation, Arabic, Quran, crowd-sourced, diverse, dataset, Islam

TL;DR: We propose a schema for paired Quranic audio and text datasets and describe the collection, labeling, and usage of the Tarteel Quranic audio dataset.

Abstract: We propose a standard schema for paired Quranic audio and text datasets. We describe the collection, labeling, and validation of the Tarteel recitation dataset, the first large-scale dataset of Quranic recitation and accompanying Arabic text collected in a crowd-sourced manner. The dataset contains 25,000 audio clips totalling 67.39 hours of audio and represents a wide variety of recitation styles, proficiencies, and speeds. The data were collected over a period of six months from over 1,200 unique individuals of different ages, genders, and ethnicities. We describe the composition of the data and contributors, describe in detail how the data was collected and processed, and give some baseline performance for preliminary machine learning algorithms that were trained and evaluated on the dataset.

URL: https://www.tarteel.ai/dataset

4 Replies

Loading