Fine-Tuning Vision Encoder-Decoder Transformers for Handwriting Text Recognition on Historical Documents

Daniel Parres, Roberto Paredes

Published: 01 Jan 2023, Last Modified: 10 Nov 2023ICDAR (4) 2023Readers: Everyone

Abstract: Handwritten text recognition (HTR) has seen significant advancements in recent years, mainly due to the incorporation of deep learning techniques. One area of HTR that has garnered particular interest is the transcription of historical documents, as there is a vast amount of records available that have yet to be processed, potentially resulting in a loss of information due to deterioration. Currently, the most widely used HTR approach is to train convolutional recurrent neural networks (CRNN) with connectionist temporal classification loss. Additionally, language models based on n-grams are often utilized in conjunction with CRNNs. While transformer models have revolutionized natural language processing, they have yet to be widely adopted in the context of HTR for historical documents. In this paper, we propose a new approach for HTR on historical documents that involves fine-tuning pre-trained transformer models, specifically vision encoder–decoder models. This approach presents several challenges, including the limited availability of large amounts of training data for specific HTR tasks. We explore various strategies for initializing and training transformer models and present a model that outperforms existing state-of-the-art methods on three different datasets. Specifically, our proposed model achieves a word error rate of 6.9% on the ICFHR 2014 Bentham dataset, 14.5% on the ICFHR 2016 Ratsprotokolle dataset, and 17.3% on the Saint Gall dataset.

0 Replies