Abstract: The language modeling paradigm for scene text recognition (STR) has demonstrated impressive universal capabilities across extensive STR scenarios. However, existing methods still encounter challenges in effectively handling text images with irregular shapes and diverse appearances (e.g., curve, artistic, multi-oriented) due to the absence of contextual information during initial decoding. In this work, inspired by the principle of ‘forest before trees’ in human visual perception, we introduce NASTR, a non-autoregressive scene text recognizer capable of endowing global-aware for the attentional decoder. Specifically, we design a global-to-local attention procedure, simulating the mechanism of globally holistic visual signal processing preceding locally detailed response in the human brain visual system. This is achieved by leveraging the global image information queries to condition the generation of glimpse vectors at each decoding time step. This procedure empowers the NASTR model to achieve on-par performance with its state-of-the-art autoregressive counterparts, while operating in a fully parallel manner. Moreover, we propose multiple optional and flexible encoding constraint components to alleviate the representation quality degradation issue caused by the global image information queries in handling STR tasks with multilingual and in multi-domains. These components constrain the global image features from the perspective of global structural, global semantic, and linguistic knowledge. Extensive experimental results demonstrate that NASTR consistently outperforms existing methods on both Chinese and English STR benchmarks. Our source code, trained models, and logs are available at https://github.com/ML-HDU/NASTR.
External IDs:doi:10.1109/tcsvt.2025.3625758
Loading