OCR Improvements for Images of Multi-page Historical Documents

Ivan Gruber, Marek Hrúz, Pavel Ircing, Petr Neduchal, Tomás Zítka, Miroslav Hlavác, Zbynek Zajíc, Jan Svec, Martin Bulín

Published: 01 Jan 2021, Last Modified: 26 Feb 2025SPECOM 2021EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This work presents a pipeline for processing digitally scanned documents, reading their textual content, and storing it in a dataset for the purpose of information retrieval. The pipeline is able to handle images of various quality, whether they were obtained by a digital scanner or camera. The image can contain multiple pages in any layout, but an approximate upright orientation is assumed. The pipeline uses Faster R-CNN to detect individual pages. These are then processed by a deskew algorithm to correct the orientation, and finally read by the Tesseract OCR system that has been retrained on a large set of synthetic images and a small set of annotated real-world documents. By applying the pipeline, we were able to increase the word recall to 60.56% which is an absolute gain of 19.19% from the baseline solution that uses only Tesseract OCR. A demo of the proposed pipeline can be found at https://archivkgb.zcu.cz/.