RoundTripOCR: A Data Generation Technique for Enhancing OCR Error Correction in Low-Resource Devanagari Languages
Abstract: Optical Character Recognition (OCR) technology has revolutionized the digitization of printed text, enabling efficient data extraction and analysis across various domains. Just like Machine Translation systems, OCR systems are prone to errors stemming from factors such as poor image quality, diverse fonts, and language variations. In this work, we address the challenge of data generation and post-OCR error correction, specifically for low-resource languages. We propose a novel approach for synthetic data generation for Devanagari languages, RoundTripOCR, that tackles the scarcity of the OCR Error Correction dataset. In this work, we release a post-OCR text correction dataset for Hindi, Marathi, Bodo, Nepali, Konkani and Sanskrit. We also present a novel approach for OCR error correction by leveraging techniques from machine translation. Our method involves translating the erroneous OCR output into a corrected form by treating the OCR errors as mistranslations in a parallel text corpus. We employ a state-of-the-art pre-trained transformer model, mBART, to learn the mapping from erroneous to correct text pairs, effectively correcting OCR errors.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: Efficient/Low-Resource Methods for NLP, Generation
Contribution Types: Approaches to low-resource settings, Data resources
Languages Studied: Bodo, Nepali, Konkani, Marathi, Hindi, Sanskrit
Submission Number: 5197
Loading