Abstract: Current mainstream text recognition models rely heavily on large-scale data, requiring expensive annotations to achieve high performance. Contrast-based self-supervised learning methods aimed at minimizing distances between positive pairs provide a nice way to alleviate this problem. Previous studies are implemented from the perspective of words, taking the entire word image as model input. But characters are actually the basic elements of words, so in this paper, we implement contrastive learning from another perspective, i.e., the perspective of characters. Specifically, a simple yet effective method, termed ChaCo, is proposed, which takes the characters and strokes (called a character unit) cropped from the word image as model input. However, in the commonly used random cropping approach, the positive pairs may contain completely different characters, in which case it is unreasonable to minimize the distance between positive pairs. To address this issue, we introduce a Character Unit Cropping Module (CUCM) to ensure the positive pairs contain the same characters by constraining the selection region of the positive sample. Experiments show that our proposed method can achieve much better representation quality than previous methods while requiring fewer computation resources. Under the semi-supervised setting, ChaCo can achieve promising performance with an accuracy improvement of 13.1 points on the IAM dataset.
Loading