Evaluating the use of large language models for post optical character recognition correction in Brazilian Portuguese

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: post-OCR error correction, Brazilian Portuguese
TL;DR: Using LLMs in post OCR correction for Brazilian Portuguese
Abstract: In recent decades digital media have taken precedence over printed media, firmly establishing themselves in everyday life. Optical Character Recognition (OCR) technology facilitates the digitization of printed text but frequently introduces errors during the process. This study investigates the effectiveness of generative Large Language Models (LLMs), like the model Gemma 3, in correcting OCR outputs in Brazilian Portuguese. Using the ESTER-Pt dataset, we assess the models' ability to leverage contextual information to identify and correct OCR-induced errors. The results demonstrate that LLMs can significantly outperform existing methods, achieving an improvement in character error rate (CER) over the current state of the art in Portuguese, reducing it from 5.12 to 1.69.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7877
Loading