Abstract: Handwritten Text Recognition (HTR) is a challenging problem that plays an essential role in digitizing and interpreting diverse handwritten documents. While traditional approaches primarily utilize CNN-RNN (CRNN) architectures, recent advancements based on Transformer architectures have demonstrated impressive results in HTR. However, these Transformer-based systems often involve high-parameter configurations and rely extensively on synthetic data. Moreover, they lack focus on efficiently integrating the ability of Transformer modules to grasp contextual relationships within the data. In this paper, we explore a lightweight integration of Transformer modules into existing CRNN frameworks to address the complexities of HTR, aiming to enhance the context of the sequential nature of the task. We present a hybrid CNN image encoder with intermediate MobileViT blocks that effectively combines the different components in a resource-efficient manner. Through extensive experiments and ablation studies, we refine the integration of these modules and demonstrate that our proposed model enhances HTR performance. Our results on the line-level IAM and RIMES datasets suggest that our proposed method achieves competitive performance with significantly fewer parameters and without integrating synthetic data compared to existing systems.
Loading