A Vision Transformer Based Scene Text Recognizer with Multi-grained Encoding and Decoding

Zhi Qiao, Zhilong Ji, Ye Yuan

12 May 2023OpenReview Archive Direct UploadReaders: Everyone

Abstract: Recently, vision Transformer (ViT) has attracted more and more attention, many works introduce the ViT into concrete vision tasks and achieve impressive performance. However, there are only a few works focused on the applications of the ViT for scene text recognition. This paper takes a further step and proposes a strong scene text recognizer with a fully ViT-based architecture. Specifically, we introduce multi- grained features into both the encoder and decoder. For the encoder, we adopt a two-stage ViT with different grained patches, where the first stage extracts extent visual features with 2D fine-grained patches and the second stage aims at the sequence of contextual features with 1D coarse-grained patches. The decoder integrates Connectionist Temporal Classification (CTC)-based and attention-based decoding, where the two decoding schemes introduce different grained features into the decoder and benefit from each other with a deep interaction. To improve the ex- traction of fine-grained features, we additionally explore self-supervised learning for text recognition with masked autoencoders. Furthermore, a focusing mechanism is proposed to let the model target the pixel recon- struction of the text area. Our proposed method achieves state-of-the-art or comparable accuracies on benchmarks of scene text recognition with a faster inference speed and nearly 50% reduction of parameters compared with other recent works.

0 Replies