Abstract: Text recognition is an inherent integration of vision and language,
encompassing the visual texture in stroke patterns and the semantic
context among the character sequences. Towards advanced text
recognition, there are three key challenges: (1) an encoder capable
of representing the visual and semantic distribution; (2) a decoder
supervises the alignment between vision and semantics; and (3)
consistency in the framework during pre-training and fine-tuning.
Inspired by masked autoencoding, a successful pre-training strategy
in both vision and language, we propose an innovative scene text
recognition approach, named VL-Reader. The novelty of VL-Reader
lies in that the interplay between vision and language is pervasive
throughout the entire process, not only in the encoding stage but
also the decoding stage, which has been previously overlooked.
Concretely, we first introduce a Masked Visual-Linguistic Recon-
struction (MVLR) objective, which aims at simultaneously modeling
visual and linguistic information. Then, we design a Masked Visual-
Linguistic Decoder (MVLD) to further leverage bi-modal feature
interaction. The architecture of VL-Reader maintains consistency
from training to inference. In the pre-training stage, VL-Reader
reconstructs both masked visual and text tokens, while in the fine-
tuning stage, the network degrades to reconstruct all characters
from an image without any masked regions. VL reader achieves an
average accuracy of 97.1% on six typical datasets, surpassing the
SOTA by 1.1%. The improvement was even more significant on chal-
lenging datasets. The results demonstrate that vision and language
reconstructor can serve as an effective scene text recognizer.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Content] Vision and Language, [Content] Multimodal Fusion
Relevance To Conference: Our work focuses on the multimodal domain of this conference, primarily studying the two modalities of vision and language. Text recognition is an inherent integration of vision and language, encompassing the visual texture in stroke patterns and the semantic context among the character sequences. In this work, we utilize a joint learning approach of vision and semantics for text recognition tasks. Specifically, we introduce a novel training objective termed Masked Vision-Language Reconstruction (MVLR) to simultaneously reconstruct masked vision-language context. Besides, we propose a cross-modal Masked Visual-Linguistic Decoder (MVLD) to conduct cross-modal interaction between visual and linguistic modalities. We hope that our work will be valuable for inspiring further research in the field of multimodal.
Supplementary Material: zip
Submission Number: 3189
Loading