Restore Text First, Enhance Image Later: Two-Stage Scene Text Image Super-Resolution with Glyph Structure Guidance

03 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Scene Text Image Super resolution, Image Super Resolution, Diffusion Models
Abstract: Current generative super-resolution methods perform well on natural images but distort text, creating a fundamental trade-off between image quality and textual readability. To address this, we introduce **TIGER** (**T**ext–**I**mage **G**uided sup**E**r-**R**esolution), a novel two-stage framework that breaks this trade-off through a *"text-first, image-later"* paradigm. TIGER explicitly decouples glyph restoration from image enhancement: it first reconstructs precise text structures and then uses them to guide subsequent full-image super-resolution. This glyph-to-image guidance ensures both high fidelity and visual consistency. To support comprehensive training and evaluation, we also contribute the **UltraZoom-ST** (UltraZoom-Scene Text), the first scene text dataset with extreme zoom (**$\times$14.29**). Extensive experiments show that TIGER achieves **state-of-the-art** performance, enhancing readability while preserving overall image quality.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 1329
Loading