PEAN: A Diffusion-Based Prior-Enhanced Attention Network for Scene Text Image Super-Resolution

Zuoyan Zhao; Hui Xue; Pengfei Fang; Shipeng Zhu

PEAN: A Diffusion-Based Prior-Enhanced Attention Network for Scene Text Image Super-Resolution

Zuoyan Zhao, Hui Xue, Pengfei Fang, Shipeng Zhu

Published: 20 Jul 2024, Last Modified: 05 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Scene text image super-resolution (STISR) aims at simultaneously increasing the resolution and readability of low-resolution scene text images, thus boosting the performance of the downstream recognition task. Two factors in scene text images, visual structure and semantic information, affect the recognition performance significantly. To mitigate the effects from these factors, this paper proposes a Prior-Enhanced Attention Network (PEAN). Specifically, an attention-based modulation module is leveraged to understand scene text images by neatly perceiving the local and global dependence of images, despite the shape of the text. Meanwhile, a diffusion-based module is developed to enhance the text prior, hence offering better guidance for the SR network to generate SR images with higher semantic accuracy. Additionally, a multi-task learning paradigm is employed to optimize the network, enabling the model to generate legible SR images. As a result, PEAN establishes new SOTA results on the TextZoom benchmark. Experiments are also conducted to analyze the importance of the enhanced text prior as a means of improving the performance of the SR network. Code will be made available.

Primary Subject Area: [Experience] Multimedia Applications

Relevance To Conference: Scene text image super-resolution (STISR) is an research area that has received concerns for many years. This work focuses on two major points that aﬀect the performance of the downstream recognition task, namely visual structure and semantic information, to provide a novel solution to STISR. We propose a Prior-Enhanced Attention Network (PEAN) to effectively enhance the text prior that guide the SR process and devise an attention-based module to handle images with texts of various lengths and shapes. Experiments show that this model is superior to previous works and we conduct thorough ablation studies to analyze this model. We believe that this model can serve as an effective and efficient alternative for STISR, thereby facilitating multimedia applications such as scene text recognition. It will also provide insights to other research areas related to multimedia processing, especially scene text image processing.

Supplementary Material: zip

Submission Number: 2034

Loading