A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages
Abstract: Modern large language models demonstrate impressive capabilities in text generation and generalization. However, they often struggle with solving text editing tasks, particularly when it comes to correcting spelling errors and mistyping. In this paper, we present a methodology for generative spelling correction (SC), tested on English and Russian languages and potentially can be extended to any language with minor changes. Our research mainly focuses on exploring natural spelling errors and mistyping in texts and studying how those errors can be emulated in correct sentences to enrich generative models' pre-train procedure effectively.
We investigate the impact of such emulations and the models' abilities across different text domains. In this work, we propose two spelling corruption techniques: 1) first one mimics human behavior when making a mistake through leveraging statistics of errors from a particular dataset, and 2) second adds the most common spelling errors, keyboard miss clicks, and some heuristics within the texts.
We conducted experiments employing various corruption strategies, models' architectures, and sizes in the pre-training and fine-tuning stages and evaluated the models using single-domain and multi-domain test sets. As a practical outcome of our work, we introduce SAGE (Spell checking via Augmentation and Generative distribution Emulation). It is a library for automatic generative SC that includes a family of pre-trained generative models and built-in augmentation algorithms.
Paper Type: long
Research Area: Generation
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources
Languages Studied: Russian, English
Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.
0 Replies
Loading