You Can even Annotate Text with Voice: Transcription-only-Supervised Text Spotting

Jingqun Tang, Su Qiao, Benlei Cui, Yuhang Ma, Sheng Zhang, Dimitrios Kanoulas

2022 (modified: 15 Nov 2022)ACM Multimedia 2022Readers: Everyone

Abstract: End-to-end scene text spotting has recently gained great attention in the research community. The majority of existing methods rely heavily on the location annotations of text instances (e.g., word-level boxes, word-level masks, and char-level boxes). We demonstrate that scene text spotting can be accomplished solely via text transcription, significantly reducing the need for costly location annotations. We propose a query-based paradigm to learn implicit location features via the interaction of text queries and image embeddings. These features are then made explicit during the text recognition stage via an attention activation map. Due to the difficulty of training the weakly-supervised model from scratch, we address the issue of model convergence via a circular curriculum learning strategy. Additionally, we propose a coarse-to-fine cross-attention localization mechanism for more precisely locating text instances. Notably, we provide a solution for text spotting via audio annotation, which further reduces the time required for annotation. Moreover, it establishes a link between audio, text, and image modalities in scene text spotting. Using only transcription annotations as supervision on both real and synthetic data, we achieve competitive results on several popular scene text benchmarks. The proposed method offers a reasonable trade-off between model accuracy and annotation time, allowing simplification of large-scale text spotting applications.

0 Replies