POCO: Low Cost Post-OCR Correction Dataset Construction via Character-Level Probabilistic Simulation

ACL ARR 2025 July Submission596 Authors

28 Jul 2025 (modified: 07 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Rapid advancements in Large Language Models (LLMs) have significantly improved multimodal AI, highlighting the importance of accurately processing image-based text through Optical Character Recognition (OCR). However, the high computational cost of powerful LLMs limits their commercial use, leading to the widespread adoption of smaller OCR models that are more error-prone, especially with low-quality images. Post-correction is essential for improving OCR outputs but is hindered by the lack of high-quality training data. Existing synthetic approaches are often computationally expensive or fail to produce realistic errors. To address this, we propose POCO (Post-OCR Correction with Output distributions), a simple and low-cost framework that uses character-level probabilistic simulations to generate realistic OCR error datasets. POCO employs a lightweight vision model (ResNet34) to predict character probabilities and inject OCR-like errors into text corpora, enabling the creation of high-quality training data at scale. Experiments show that POCO significantly improves OCR post-correction performance, demonstrating its practicality and effectiveness. The code will be made publicly available on GitHub.
Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: OCR post-correction, Synthetic data generation, Low-resource text correction, Character-level modeling
Contribution Types: Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: Chinese, Korean
Previous URL: https://openreview.net/forum?id=JmdKlK6CnB
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: Yes, I want a different area chair for our submission
Reassignment Request Reviewers: Yes, I want a different set of reviewers
Justification For Not Keeping Action Editor Or Reviewers: We respectfully request reassignment as the primary contribution of our work has shifted, and we have updated the research area to better reflect its current focus. Furthermore, the previous reviewers did not provide response during the review phase, and no substantive feedback was received.
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: Limitation
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Experiments
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: Experiments
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: Experiments
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Experiments
B6 Statistics For Data: Yes
B6 Elaboration: Experiments
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Appendix
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Experiments, just use single hyperparameter setting for fair comparison with each models
C3 Descriptive Statistics: Yes
C3 Elaboration: Experiments, single run
C4 Parameters For Packages: Yes
C4 Elaboration: Experiments
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: Yes
E1 Elaboration: Appendix
Author Submission Checklist: yes
Submission Number: 596
Loading