Exploring Dataset Size and Diversity for OCR Post-Correction with hmByT5 Models

Exploring Dataset Size and Diversity for OCR Post-Correction with hmByT5 Models

ACL ARR 2025 February Submission4250 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: This study explores the application of the hmByT5 model for Optical Character Recognition (OCR) post-correction, focusing on historical German job advertisements. Two versions of the model—standard and fine-tuned on the ICDAR-2019 dataset—were evaluated across subsets of the JobAds dataset. The effects of dataset size and OCR model diversity on post-correction performance were analyzed. Results show that larger training datasets improve performance, but with diminishing returns, suggesting an optimal balance between annotation effort and model effectiveness. Training on outputs from multiple OCR systems enhances generalization with limited data but may introduce conflicting patterns in larger datasets. Fine-tuning on unrelated datasets, such as ICDAR, reduced performance, underscoring the importance of domain alignment in pre-training.

Paper Type: Short

Research Area: NLP Applications

Research Area Keywords: NLP in resource-constrained settings, fine-tuning, historical NLP

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources

Languages Studied: Historical German

Submission Number: 4250

Loading