TWIST: Text-only Weakly Supervised Scene Text Spotting Using Pseudo Labels

Lilong Wen; Xiu Tang; Dongxiang Zhang

TWIST: Text-only Weakly Supervised Scene Text Spotting Using Pseudo Labels

Lilong Wen, Xiu Tang, Dongxiang Zhang

Published: 01 Jan 2024, Last Modified: 19 Feb 2025ICMR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Scene text spotting plays a pivotal role in image understanding. However, building a robust model for such a task necessitates substantial annotated data. Various efforts have been made to reduce the burden of extensive data labeling. In this paper, we focus on the minimum labor cost approach that solely relies on text-only annotations. Under this weakly supervised paradigm, existing methods encounter intrinsic difficulties since the location information is not available for training. To compensate, these methods often employ attention maps generated from models pre-trained on tasks such as text recognition or classification to predict spatial information. This approach, however, impedes the possibility of comprehensive end-to-end training and does not ensure optimal performance outcomes. What's more, the attention map for a single word tends to focus on the distinguishing areas, which often yields location predictions with suboptimal boundaries. To overcome these limitations, we introduce an innovative methodology that integrates pseudo-label generation to enable end-to-end training of the spotting network that optimizes text recognition and location estimation at the same time called TWIST. During the training, to address the problem associated with incomplete attention maps and obtain pseudo-labels that can cover the whole word, TWIST treats characters as elemental units. So the pseudo-label for each given text instance is generated by aggregating the inferred locations of their constituent characters, through a masked character prediction task. Then the generated pseudo-labels with corresponding textual content are used to further optimize the parameters of the spotting network. This integrated approach facilitates end-to-end training and achieves new state-of-the-art results in several public detection and end-to-end recognition benchmarks under text-only supervision.

Loading