AraPunc: Arabic Punctuation Restoration Using Transformers

Published: 2023, Last Modified: 15 Jul 2025AICCSA 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Adding punctuation to Arabic text enhances readability and clarity. This is very clear in many applications such as automatic speech recognition (ASR) and machine translation (MT) systems. In this paper, we introduce a new punctuation dataset. Our AraPunc dataset is based on the pre-processing of the Tashkeela "Arabic diacritization corpus". We keep six classes: space ‘0’, full-stop ‘.’, comma ‘,’, the colon’:’, semicolon ‘;’, and question mark ‘?’. We treat the punctuation restoration task as a token-wise classification problem that assigns a class (one of the six classes) to each word on the input sentence. We train different transformer-based language models on our new dataset. We found that XLM-RoBERTa outperforms other transformer-based models with a macro-average F1-score of 0.7851 on the AraPunc test set. We also allowed cross-finetuning between QCRI Aljazeera Speech Recognition (QASR) dataset and our novel AraPunc dataset. We managed to achieve a macro average F1-score of 0.7050 on the QASR test set, after training the model first using the AraPunc dataset. Our experiments revealed that AraPunc provides better representations which makes it more suitable to fine-tune models for punctuation restoration task. We release our dataset and code to facilitate future research on this topic1.
Loading