Abstract: The adoption of Optical Character Recognition (OCR) tools has been central to the increased digitization of historical documents. However, the errors introduced during OCR, particularly in texts with a specialized vocabulary (SV), necessitate effective post-OCR correction methodologies. This study introduces a novel approach that leverages weak supervision and self-supervised fine-tuning to enhance post-OCR correction without the need for substantial manual annotations. By using multi-noise-level synthetic data, generated through automatically-extracted OCR errors and applied to clean texts, we can train robust models tailored for post-OCR tasks. Furthermore, we propose a unique self-supervised fine-tuning strategy, applied specifically to long texts, enables models to adeptly handle out-of-vocabulary problems and SV. Additionally, we tested the performance of the GPT model on post-OCR tasks.
Loading