Free Lunch: Frame-level Contrastive Learning with Text Perceiver for Robust Scene Text Recognition in Lightweight Models
Abstract: Lightweight models play an important role in real-life applications, especially in the recent mobile device era. However, due to limited network scale and low-quality images, the performance of lightweight models on Scene Text Recognition (STR) tasks is still much to be improved. Recently, contrastive learning has shown its power in many areas, with promising performances without additional computational cost. Based on these observations, we propose a new efficient and effective frame-level contrastive learning (FLCL) framework for lightweight STR models. The FLCL framework consists of a backbone to extract basic features, a Text Perceiver Module (TPM) to focus on text-relevant representations, and a FLCL loss to update the network. The backbone can be any feature extraction architecture. The TPM is an innovative Mamba-based structure that is designed to suppress features irrelevant to the text content from the backbone. Unlike existing word-level contrastive learning, we look into the nature of the STR task and propose the frame-level contrastive learning loss, which can work well with the famous Connectionist Temporal Classification loss. We conduct experiments on six well-known STR benchmarks as well as a new low-quality dataset. Compared to vanilla contrastive learning and other non-parameter methods, the FLCL framework significantly outperforms others on all datasets, especially the low-quality dataset. In addition, character feature visualization demonstrates that the proposed method can yield more discriminative character features for visually similar characters, which also substantiates the efficacy of the proposed methods. Codes and the low-quality dataset will be available soon.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Experience] Art and Culture, [Content] Vision and Language
Relevance To Conference: This work significantly advances the field of multimedia and multimodal processing by addressing the challenge of robust scene text recognition in low-quality scenarios. The paper proposes a novel frame-level contrastive learning framework specifically tailored for lightweight models. This approach innovatively integrates a Text Perceiver Module and a frame-level contrastive loss function to enhance character-focused representation learning in STR tasks.
By leveraging the proposed technique, this paper improves the model's ability to discern and transcribe text accurately from images of varying quality, a critical aspect of multimodal information extraction from real-world scenes. This research contributes to the broader multimedia community by providing a universal strategy applicable to CTC-based STR models, thereby pushing the boundaries of performance and efficiency in recognizing visually degraded text within complex visual scenes. Experimental validation across multiple benchmarks and a newly introduced low-quality dataset confirms its state-of-the-art performance and practical significance in improving multimedia content understanding.
Submission Number: 2307
Loading