Advancing Arabic Diacritization: Improved Datasets, Benchmarking, and State-of-the-Art Models

Advancing Arabic Diacritization: Improved Datasets, Benchmarking, and State-of-the-Art Models

ACL ARR 2025 May Submission6116 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Arabic diacritics, similar to short vowels in English, provide phonetic and grammatical information but are typically omitted in written Arabic, leading to ambiguity. Diacritization (aka diacritic restoration or vowelization) is essential for natural language processing. This paper advances Arabic diacritization through the following contributions: first, we propose a methodology to analyze and refine a large diacritized corpus to improve training quality. Second, we introduce WikiNews-2024, a multi-reference evaluation methodology with an updated version of the standard benchmark ``WikiNews-2014''. In addition, we propose a BiLSTM-based model that achieves state-of-the-art results with 3.12\% and 2.70\% WER on WikiNews-2014 and WikiNews-2024. Moreover, we develop a model that preserves user-specified diacritics while maintaining accuracy. Lastly, we demonstrate that augmenting training data enhances performance in low-resource settings.

Paper Type: Long

Research Area: Syntax: Tagging, Chunking and Parsing

Research Area Keywords: morphologically-rich languages pos tagging, parsing and related tasks

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: Arabic

Submission Number: 6116

Loading