A proposal for post-OCR spelling correction using Language Models

Published: 18 Oct 2024, Last Modified: 30 Nov 2024lxai-neurips-24EveryoneRevisionsBibTeXCC BY 4.0
Track: Full Paper
Abstract: This work explores the use of Language Models (LMs) to correct residual errors in texts extracted by OCR and HTR (Handwritten Text Recognition) systems. We propose a general approach but utilize the images from Brazilian handwritten essays of the BRESSAY dataset as a use case. Two standard LMs (Bart and ByT5) and two LLMs (LLama 1 and LLama 2) were evaluated in this context. The results indicate that the smaller LMs outperformed the LLMs in terms of error rate reduction (CER and WER). Traditional correction methods, such as Symspell and Norvig, were influential in some cases but fell short of the results obtained by the LMs. ByT5 with byte-level tokenization improved CER and WER, proving performance for texts with high noise. As a result, smaller LMs, after fine-tuning, are more efficient and cheaper for post-OCR corrections. We identify and propose promising future studies involving correction at broader levels of context, such as paragraphs. Code is available at https://github.com/savi8sant8s/ptbr-post-ocr-sc-llm.
Submission Number: 25
Loading