Abstract: Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets. However, the domain gap between synthetic and real images poses a challenge in acquiring feature representations that align well with images on real scenes, thereby limiting the performance of these methods. We note that vision-language models like CLIP, pre-trained on extensive real image-text pairs, effectively align images and text in a unified embedding space, suggesting the potential to derive the representations of real images from text alone. Building upon this premise, we introduce a novel method named Decoder Pre-training with only text for STR (DPTR). DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder. An Offline Randomized Perturbation (ORP) strategy is introduced. It enriches the diversity of text embeddings by incorporating natural image embeddings extracted from the CLIP image encoder, effectively directing the decoder to acquire the potential representations of real images. In addition, we introduce a Feature Merge Unit (FMU) that guides the extracted visual embeddings focusing on the character foreground within the text image, thereby enabling the pre-trained decoder to work more efficiently and accurately. Extensive experiments across various STR decoders and language recognition tasks underscore the broad applicability and remarkable performance of DPTR, providing a novel insight for STR pre-training. Code is available at https://github.com/Topdu/OpenOCR.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: 1、For the first time, we propose DPTR, a decoder pre-training method without using text images, %which unifies the pipeline of decoder pre-training for different languages, and
which provides a brand-new line of insight for STR pre-training.
2、Our proposed ORP method effectively improves the performance of the model by adding background noise to the text embeddings. FMU uses a learnable position query to search for character features, removing redundant background information, which helps to improve the efficiency and accuracy of the whole model.
3、DPTR achieves state-of-the-art performance on Chinese, English, and mixed multi-language datasets, showcasing its remarkable efficacy and great universality in multi-language text recognition tasks.
Submission Number: 3696
Loading