Abstract: Text images contain both visual and linguistic information. However, existing pre-training techniques for text recognition mainly focus on either visual representation learning or linguistic knowledge learning. In this paper, we propose a novel approach to unify vision and language pre-training in the classical encoder-decoder recognition framework. We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images, which allows us to learn strong visual representations. In contrast to introducing linguistic knowledge with an additional language model, we directly pre-train the sequence decoder. Specifically, we transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder using a proposed masked image-language modeling scheme.
Significantly, the encoder is frozen during the pre-training phase of the sequence decoder. Experimental results demonstrate that our proposed method achieves superior performance on benchmark datasets, including Chinese and English text images. The code for our approach will be made available.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: 1、We revised the paper title from 'Scene Text Recognition with Masked Vision-Language Pre-training' to 'MaskOCR: Scene Text Recognition with Masked Vision-Language Pre-training'.
Assigned Action Editor: ~Dumitru_Erhan1
Submission Number: 1656
Loading