Keywords: Scene Text Recognition, Non Autogressive
TL;DR: This paper proposes a non-autogressive transformer for scene text recognition
Abstract: Autoregressive-based attention methods have made significant advance in scene text recognition. However, the inference speed of these methods is limited due to their iterative decoding scheme. In contrast, the non-autoregressive methods has a parallel decoding paradigm, making them much faster than the autoregressive decoder. The dilemma is that, though the speed is increased, the non-autoregressive methods are based on the character-wise independent assumption, making them perform much worse than the autoregressive methods. In this paper, we propose a simple non-autoregressive transformer-based text recognizer, named NAText, by proposing a progressive learning approach to force the network to learn the relationship between characters. Furthermore, we redesign the query composition by introducing positional encoding of the character center. And it has clear physical meanings than the conventional one. Experiments show that our NAText helps to better utilize the positional information for 2D feature aggregation. With all these techniques, the NAText has achieved competitive performance to the state-of-the-art methods.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9284
Loading