Scene Text Recognition Via k-NN Attention-Based Decoder and Margin-Based Softmax Loss

Published: 01 Jan 2024, Last Modified: 05 Jun 2025PRCV (7) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Facing various complex background and diverse text image shape, this paper proposes an encoder-decoder-based scene text recognition model named E2D-Rec to enhance the recognition capability of irregular text and achieve stronger generalization. Firstly, a text rectification network is introduced to transform irregular texts such as curved and skewed texts into relatively regular ones, which are iteratively learned as several control points of text regions within the model. Through control points to drive TPS interpolation transformation, rectified text images are obtained. Then, a modeling network based on the encoder-decoder architecture sequentially predicts text sequences in an auto-regressive manner. The visual encoder is utilized to generate image patch embedding from text image, and the visual-textual decoder learns the correlation between word embedding and image patch embedding via k-NN attention selection. Finally, during the training phase, a loss function based on inter-class penalty is adopted as the model’s objective function. By widening the boundaries between different classes in the final label space mapping layer, the model learns deep features with high discriminative power. Experimental results validate the proposed model’s improved recognition performance on the Union14M-Benchmark and six commonly used datasets.
Loading