Automated Grammar Error Correction for Urdu using Deep Learning

ACL ARR 2024 June Submission5303 Authors

16 Jun 2024 (modified: 02 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Automated Grammar Error Correction (GEC) is an active area of research within the field of Natural Language Processing (NLP), yet its scope remains restricted to English and other resource-rich languages. Urdu is a language that is widely spoken in South Asia. However, due to the lack of annotated datasets no work has been in field of GEC for Urdu language. This paper presents an GEC model for Urdu. In addition, we also present a dataset that contains 1200 pairs of grammatically correct and incorrect sentences in Urdu that was manually curated from children books. Moreover, we also scrapped 400 children stories from Rekhta, an Urdu Literary website, and introduced errors probabilistically to create a dataset with 36,000 pairs of grammatically correct and incorrect sentences. The model that we used was mT5, which is a multilingual version of T5 transformer based model presented by Google. We trained the model in two stages. First, we trained the model on the manually curated dataset. Then, we trained the same model on the dataset that was scrapped from web. Finally, we tested the model by on Wikipedia Edit History dataset containing only grammatical errors which were identified using ERRANT. F0.5 Score, GLEU, Recall and Precision were used as evaluation criteria. The F0.5 scores for the test dataset after fine tuning the MT5 Base model on Raw + Synthetic Dataset are: NOUN INFL 0.63, ADP INFL 0.76, VERB INFL 0.73, VERB FORM 0.66, ADJ INFL 0.76, and PRON INFL 0.74. Additionally, our study is the first to focus on GEC systems, as to the best of our knowledge, no prior work has been done in this field.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: educational applications, GEC
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data resources
Languages Studied: Urdu
Submission Number: 5303
Loading