Abstract: While Connectionist Temporal Classification (CTC) has become a prevalent sequence learning framework in Scene Text Recognition (STR), its characteristic spiky distribution outputs introduce critical alignment challenges. This inherent limitation often leads to suboptimal text recognition performance. To address this issue, current methods predominantly adopt two paradigms: architectural modifications, often involving auxiliary network components, and advanced loss function design. However, these techniques fail to fully capitalize on the fundamental properties of text. In contrast, this paper presents a novel approach, PerturbCTC, that aims to explore the intrinsic feature consistency within foreground text. First, we systematically analyze the characteristics of foreground text to uncover underlying feature invariance patterns. Building on these insights, we propose an innovative feature perturbation strategy that harnesses the property of texts to enhance model perception. Additionally, we introduce the Distribution Alignment Module (DAM), which addresses the challenges of implicit alignment learning in the CTC framework. Our method is model-agnostic and can be generalized to both English and Chinese scene text benchmarks. Remarkably, our approach achieves accuracy improvements of 1.3% on the Common benchmark and 3.9% on the Union14M benchmark, outperforming existing methods without increasing model parameters, and setting new state-of-the-art results.
External IDs:dblp:conf/icdar/LiWSWZYQZ25
Loading