POCO: Low Cost Post-OCR Correction Dataset Construction via Character-Level Probabilistic Simulation
Abstract: Rapid advancements in Large Language Models (LLMs) have significantly enhanced multimodal AI, emphasizing the importance of effectively processing image-based text through Optical Character Recognition (OCR). However, the extensive computational resources required by high-performing LLMs limit their commercial applicability, making smaller OCR models prevalent despite their susceptibility to recognition errors, particularly with low-quality images. Addressing OCR errors through post-correction is essential but constrained by the scarcity of high-quality training datasets. Existing synthetic methods are either resource-intensive or insufficiently realistic. To overcome these limitations, we propose POCO (Post-OCR Correction with Output distributions), a novel low-cost framework leveraging character-level probabilistic simulations to synthetically generate realistic OCR error datasets. Our framework employs a lightweight vision model (ResNet34) to predict character probabilities and inject OCR-like errors into text corpora, effectively creating extensive and high-quality training datasets. Experimental results demonstrate significant improvements in OCR post-correction, validating our framework's practicality and effectiveness.
Paper Type: Short
Research Area: Special Theme (conference specific)
Research Area Keywords: OCR Post-Correction, Dataset Construction, Low-Resource
Contribution Types: Reproduction study, Data resources
Languages Studied: English, Chinese, Korean
Submission Number: 6980
Loading