Abstract: Inducing linguistic knowledge for scene text recognition (STR) is a new trend that could provide semantics for performance boost. However, most auto-regressive STR models optimize one-step ahead prediction (i.e., 1-gram prediction) for character sequence, which only utilizes the previous semantic context. Most non-auto-regressive models only apply linguistic knowledge individually on the output sequence to refine the results in parallel, which do not fully utilize the visual clues concurrently. In this paper, we propose a novel language-based STR model, called ProphetSTR. It adopts an n-stream self-attention mechanism in the decoder to predict the next
characters simultaneously based on the previous predictions at each time step. It could utilize the previous semantic information and the near future clues, encouraging the model to predict more accurate results. If the prediction results for the same character at successive time steps are inconsistent, we should not trust any of them. Otherwise, they are reliable predictions. Therefore, we propose a multi-modality verification module, masking the unreliable semantic features and inputting with visual and trusted semantic ones simultaneously for masked prediction recovery in parallel. It learns to align different modalities implicitly and considers both visual context and linguistic knowledge, which could generate more reliable results. Furthermore, we propose a multi-scale weight-sharing encoder for multi-granularity image representation. Extensive experiments demonstrate that ProphetSTR achieves state-of-the-art performances on many benchmarks. Further ablative studies prove the effectiveness of our proposed components.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: Scene text recognition (STR) contributes significantly to the field of multimedia by enabling the automatic extraction and understanding of textual information from visual content. This technology bridges the gap between image-based data and textual data, allowing for more efficient and intelligent processing and utilization of multimedia content.
Supplementary Material: zip
Submission Number: 4604
Loading