TextViTCNN: Enhancing Natural Scene Text Recognition with Hybrid Transformer and Convolutional Networks

Elham Eli, Wenting Xu, Alimjan Aysa, Hornisa Mamat, Kurban Ubul

Published: 2024, Last Modified: 12 Jun 2025PRCV (7) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In the field of STR, This research presents TextViTCNN, an innovative architecture that merges the benefits of CNNs and ViT. By innovatively integrating features from CNNs and ViT, TextViTCNN provides a powerful solution to cope with the inherent complexity of STR. Our model is particularly adept at handling diverse and irregular English and self-constructed Uyghur texts, and significantly improves recognition accuracy by effectively merging local and global features through a learning-based feature fusion layer. The decoder employs a strategy that incorporates mask and substitution context learning, and integrates word length information through the training process of a pre-trained language model (PLM), allowing TextViTCNN to achieve the most advanced performance in experiments.