TextViTCNN: Enhancing Natural Scene Text Recognition with Hybrid Transformer and Convolutional Networks
Abstract: In the field of STR, This research presents TextViTCNN, an innovative architecture that merges the benefits of CNNs and ViT. By innovatively integrating features from CNNs and ViT, TextViTCNN provides a powerful solution to cope with the inherent complexity of STR. Our model is particularly adept at handling diverse and irregular English and self-constructed Uyghur texts, and significantly improves recognition accuracy by effectively merging local and global features through a learning-based feature fusion layer. The decoder employs a strategy that incorporates mask and substitution context learning, and integrates word length information through the training process of a pre-trained language model (PLM), allowing TextViTCNN to achieve the most advanced performance in experiments.
Loading