Abstract: Most text recognition methods are trained on large amounts of labeled data. Although text images are easily accessible, labeling them is costly. Thus how to utilize the unlabeled data is worth studying. In this paper, we propose a MUltiple Granularity Semi-supervised (MUGS) method using both labeled and unlabeled data for text recognition. Inspired by the hierarchical structure (sentence-word-character) of text, we apply semi-supervised learning at both word-level and character-level. Specifically, a Dynamic Aggregated Self-training (DAS) framework is introduced to generate pseudo-labels from unlabeled data at word-level. To ensure the quality and stability of the pseudo-labeling procedure, the pseudo-labels are aggregated from one dynamic model queue which keeps updating in the whole semi-supervised training process. At the character-level, a novel module named WTC (Word To Character) that can convert sequential features to character representations is invented. Next, contrastive learning is applied to these character representations for better fine-grained visual modeling. The characters from various images that share the same classes are pulled together and the ones in different classes are set far apart in the representation space. With the combination of supervisions in different granularities, more information can be exploited from the unlabeled data. The effectiveness and robustness of the model are enhanced by a large margin. Comprehensive experiments on several public benchmarks validate our method and competitive performance is achieved with much fewer labeled data.
Loading