Abstract: The widespread absence of diacritical marks in Arabic text poses a significant challenge for Arabic natural language processing (NLP). This paper explores instances of naturally occurring diacritics, referred to as ``diacritics in the wild,'' to unveil patterns and latent information across six diverse genres: news articles, novels, children's books, poetry, political documents, and ChatGPT outputs. We present a new annotated dataset that maps real-world partially diacritized words to their maximal full diacritization in context. Additionally, we propose extensions to the analyze-and-disambiguate approach in Arabic NLP to leverage these diacritics, resulting in notable improvements. Our contributions encompass a thorough analysis, a valuable dataset, and an extended diacritization algorithm. We release our code and datasets as open source.
Paper Type: long
Research Area: Phonology, Morphology and Word Segmentation
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: Arabic
0 Replies
Loading