Keywords: NLP, GEC, Corpora Generation
Abstract: Grammatical Error Correction (GEC) for Urdu remains an under-researched area due to the lack of annotated datasets. This paper addresses the challenge of generating a robust corpus for fine-tuning deep learning models aimed at Urdu GEC. We propose a method for synthesizing a large dataset by collecting errors from the Urdu WikiEdits history, learning from them, and inserting similar errors in grammatically correct sentences to generate incorrect sentences with grammatical errors, hence creating a pair of grammatically correct and incorrect sentences. Furthermore, we have created a Gold Dataset by extracting errors from exam copies of students. After the generation of the dataset, we have also fine-tuned various models against synthetically generated data and evaluated them against gold data to show the quality of synthetic data generation.
Paper Type: Long
Research Area: Low-resource Methods for NLP
Research Area Keywords: GEC, Urdu, NLP
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Data analysis
Languages Studied: Urdu
Submission Number: 10879
Loading