OTE: Exploring Accurate Scene Text Recognition Using One Token

Published: 2024, Last Modified: 15 Jan 2026CVPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In this paper, we propose a novel framework to fully exploit the potential of a single vector for scene text recognition (STR). Different from previous sequence-to-sequence methods that rely on a sequence of visual tokens to rep-resent scene text images, we prove that just one token is enough to characterize the entire text image and achieve ac-curate text recognition. Based on this insight, we introduce a new paradigm for STR, called One Token rEcognizer (OTE). Specifically, we implement an image-to-vector en-coder to extract the fine-grained global semantics, elimi-nating the need for sequential features. Furthermore, an elegant yet potent vector-to-sequence decoder is designed to adaptively diffuse global semantics to corresponding character locations, enabling both autoregressive and non-autoregressive decoding schemes. by executing decoding within a high-level representational space, our vector-to-sequence (V2S) approach avoids the alignment issues between visual tokens and character embeddings prevalent in traditional sequence-to-sequence methods. Remarkably, due to introducing character-wise fine-grained information, such global tokens also boost the performance of scene text retrieval tasks. Extensive experiments on synthetic and real datasets demonstrate the effectiveness of our method by achieving new state-of-the-art results on various public STR benchmarks. Our code is available at h t t $P$ https://github.com/Xu-Jianjun/OTE.
Loading