POCO: Low Cost Post-OCR Correction Dataset Construction via Character-Level Probabilistic Simulation

POCO: Low Cost Post-OCR Correction Dataset Construction via Character-Level Probabilistic Simulation

ACL ARR 2025 May Submission6980 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Rapid advancements in Large Language Models (LLMs) have significantly enhanced multimodal AI, emphasizing the importance of effectively processing image-based text through Optical Character Recognition (OCR). However, the extensive computational resources required by high-performing LLMs limit their commercial applicability, making smaller OCR models prevalent despite their susceptibility to recognition errors, particularly with low-quality images. Addressing OCR errors through post-correction is essential but constrained by the scarcity of high-quality training datasets. Existing synthetic methods are either resource-intensive or insufficiently realistic. To overcome these limitations, we propose POCO (Post-OCR Correction with Output distributions), a novel low-cost framework leveraging character-level probabilistic simulations to synthetically generate realistic OCR error datasets. Our framework employs a lightweight vision model (ResNet34) to predict character probabilities and inject OCR-like errors into text corpora, effectively creating extensive and high-quality training datasets. Experimental results demonstrate significant improvements in OCR post-correction, validating our framework's practicality and effectiveness.

Paper Type: Short

Research Area: Special Theme (conference specific)

Research Area Keywords: OCR Post-Correction, Dataset Construction, Low-Resource

Contribution Types: Reproduction study, Data resources

Languages Studied: English, Chinese, Korean

Submission Number: 6980

Loading