VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer

Humen Zhong; Zhibo Yang; Zhaohai Li; Peng Wang; Jun Tang; Wenqing Cheng; Cong Yao

VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer

Humen Zhong, Zhibo Yang, Zhaohai Li, Peng Wang, Jun Tang, Wenqing Cheng, Cong Yao

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Text recognition is an inherent integration of vision and language, encompassing the visual texture in stroke patterns and the semantic context among the character sequences. Towards advanced text recognition, there are three key challenges: (1) an encoder capable of representing the visual and semantic distribution; (2) a decoder supervises the alignment between vision and semantics; and (3) consistency in the framework during pre-training and fine-tuning. Inspired by masked autoencoding, a successful pre-training strategy in both vision and language, we propose an innovative scene text recognition approach, named VL-Reader. The novelty of VL-Reader lies in that the interplay between vision and language is pervasive throughout the entire process, not only in the encoding stage but also the decoding stage, which has been previously overlooked. Concretely, we first introduce a Masked Visual-Linguistic Recon- struction (MVLR) objective, which aims at simultaneously modeling visual and linguistic information. Then, we design a Masked Visual- Linguistic Decoder (MVLD) to further leverage bi-modal feature interaction. The architecture of VL-Reader maintains consistency from training to inference. In the pre-training stage, VL-Reader reconstructs both masked visual and text tokens, while in the fine- tuning stage, the network degrades to reconstruct all characters from an image without any masked regions. VL reader achieves an average accuracy of 97.1% on six typical datasets, surpassing the SOTA by 1.1%. The improvement was even more significant on chal- lenging datasets. The results demonstrate that vision and language reconstructor can serve as an effective scene text recognizer.

Primary Subject Area: [Content] Media Interpretation

Secondary Subject Area: [Content] Vision and Language, [Content] Multimodal Fusion

Relevance To Conference: Our work focuses on the multimodal domain of this conference, primarily studying the two modalities of vision and language. Text recognition is an inherent integration of vision and language, encompassing the visual texture in stroke patterns and the semantic context among the character sequences. In this work, we utilize a joint learning approach of vision and semantics for text recognition tasks. Specifically, we introduce a novel training objective termed Masked Vision-Language Reconstruction (MVLR) to simultaneously reconstruct masked vision-language context. Besides, we propose a cross-modal Masked Visual-Linguistic Decoder (MVLD) to conduct cross-modal interaction between visual and linguistic modalities. We hope that our work will be valuable for inspiring further research in the field of multimodal.

Supplementary Material: zip

Submission Number: 3189

Loading