A Vision Transformer Based Scene Text Recognizer with Multi-grained Encoding and Decoding

Zhi Qiao, Zhilong Ji, Ye Yuan, Jinfeng Bai

Published: 2022, Last Modified: 16 May 2023ICFHR 2022Readers: Everyone

Abstract: Recently, vision Transformer (ViT) has attracted more and more attention, many works introduce the ViT into concrete vision tasks and achieve impressive performance. However, there are only a few works focused on the applications of the ViT for scene text recognition. This paper takes a further step and proposes a strong scene text recognizer with a fully ViT-based architecture. Specifically, we introduce multi-grained features into both the encoder and decoder. For the encoder, we adopt a two-stage ViT with different grained patches, where the first stage extracts extent visual features with 2D fine-grained patches and the second stage aims at the sequence of contextual features with 1D coarse-grained patches. The decoder integrates Connectionist Temporal Classification (CTC)-based and attention-based decoding, where the two decoding schemes introduce different grained features into the decoder and benefit from each other with a deep interaction. To improve the extraction of fine-grained features, we additionally explore self-supervised learning for text recognition with masked autoencoders. Furthermore, a focusing mechanism is proposed to let the model target the pixel reconstruction of the text area. Our proposed method achieves state-of-the-art or comparable accuracies on benchmarks of scene text recognition with a faster inference speed and nearly $$50\%$$ reduction of parameters compared with other recent works.

0 Replies