ESTER-Pt: An Evaluation Suite for TExt Recognition in PortugueseOpen Website

Published: 01 Jan 2023, Last Modified: 01 Mar 2024ICDAR (3) 2023Readers: Everyone
Abstract: Optical Character Recognition (OCR) is a technology that enables machines to read and interpret printed or handwritten texts from scanned images or photographs. However, the accuracy of OCR systems can vary depending on several factors, such as the quality of the input image, the font used, and the language of the document. As a general tendency, OCR algorithms perform better in resource-rich languages as they have more annotated data to train the recognition process. In this work, we propose ESTER-Pt, an Evaluation Suite for TExt Recognition in Portuguese. Despite being one of the largest languages in terms of speakers, OCR in Portuguese remains largely unexplored. Our evaluation suite comprises four types of resources: synthetic text-based documents, synthetic image-based documents, real scanned documents, and a hybrid set with real image-based documents that were synthetically degraded. Additionally, we provide results of OCR engines and post-OCR correction tools on ESTER-Pt, which can serve as a baseline for future work.
0 Replies

Loading