POCO: Low Cost Post-OCR Correction Dataset Construction via Character-Level Probabilistic Simulation

POCO: Low Cost Post-OCR Correction Dataset Construction via Character-Level Probabilistic Simulation

ACL ARR 2025 July Submission596 Authors

28 Jul 2025 (modified: 07 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Rapid advancements in Large Language Models (LLMs) have significantly improved multimodal AI, highlighting the importance of accurately processing image-based text through Optical Character Recognition (OCR). However, the high computational cost of powerful LLMs limits their commercial use, leading to the widespread adoption of smaller OCR models that are more error-prone, especially with low-quality images. Post-correction is essential for improving OCR outputs but is hindered by the lack of high-quality training data. Existing synthetic approaches are often computationally expensive or fail to produce realistic errors. To address this, we propose POCO (Post-OCR Correction with Output distributions), a simple and low-cost framework that uses character-level probabilistic simulations to generate realistic OCR error datasets. POCO employs a lightweight vision model (ResNet34) to predict character probabilities and inject OCR-like errors into text corpora, enabling the creation of high-quality training data at scale. Experiments show that POCO significantly improves OCR post-correction performance, demonstrating its practicality and effectiveness. The code will be made publicly available on GitHub.

Paper Type: Short

Research Area: Resources and Evaluation

Research Area Keywords: OCR post-correction, Synthetic data generation, Low-resource text correction, Character-level modeling

Contribution Types: Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: Chinese, Korean

Previous URL: https://openreview.net/forum?id=JmdKlK6CnB

Explanation Of Revisions PDF: pdf

Reassignment Request Area Chair: Yes, I want a different area chair for our submission

Reassignment Request Reviewers: Yes, I want a different set of reviewers

Justification For Not Keeping Action Editor Or Reviewers: We respectfully request reassignment as the primary contribution of our work has shifted, and we have updated the research area to better reflect its current focus. Furthermore, the previous reviewers did not provide response during the review phase, and no substantive feedback was received.

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: Yes

A2 Elaboration: Limitation

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: Experiments

B2 Discuss The License For Artifacts: Yes

B2 Elaboration: Experiments

B3 Artifact Use Consistent With Intended Use: Yes

B3 Elaboration: Experiments

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: Yes

B5 Elaboration: Experiments

B6 Statistics For Data: Yes

B6 Elaboration: Experiments

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Appendix

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: Experiments, just use single hyperparameter setting for fair comparison with each models

C3 Descriptive Statistics: Yes

C3 Elaboration: Experiments, single run

C4 Parameters For Packages: Yes

C4 Elaboration: Experiments

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: Yes

E1 Elaboration: Appendix

Author Submission Checklist: yes

Submission Number: 596

Loading