Evaluating the use of large language models for post optical character recognition correction in Brazilian Portuguese
Keywords: post-OCR error correction, Brazilian Portuguese
TL;DR: Using LLMs in post OCR correction for Brazilian Portuguese
Abstract: In recent decades digital media have taken precedence over printed media, firmly establishing themselves in everyday life. Optical Character Recognition (OCR) technology facilitates the digitization of printed text but frequently introduces errors during the process. This study investigates the effectiveness of generative Large Language Models (LLMs), like the model Gemma 3, in correcting OCR outputs in Brazilian Portuguese. Using the ESTER-Pt dataset, we assess the models' ability to leverage contextual information to identify and correct OCR-induced errors. The results demonstrate that LLMs can significantly outperform existing methods, achieving an improvement in character error rate (CER) over the current state of the art in Portuguese, reducing it from 5.12 to 1.69.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7877
Loading