Text-Aware Image Restoration with Diffusion Models

Jaewon Min; Jin Hyeon Kim; Paul Hyunbin Cho; Jaeeun Lee; Jihye Park; Park Min Kyu; Sangpil Kim; Hyunhee Park; Seungryong Kim

Text-Aware Image Restoration with Diffusion Models

Jaewon Min, Jin Hyeon Kim, Paul Hyunbin Cho, Jaeeun Lee, Jihye Park, Park Min Kyu, Sangpil Kim, Hyunhee Park, Seungryong Kim

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion Model, Image Restoration, Text-spotting, Scene-Text Image Super Resolution

TL;DR: We introduce TAIR, a new task for restoring images with text contents by combining diffusion-based restoration and text spotting. We also propose a dataset, SA-Text, enabling joint optimization of visual and textual fidelity.

Abstract: While diffusion models have achieved remarkable success in natural image restoration, they often fail to faithfully recover textual regions, frequently producing plausible yet incorrect text-like patterns, a phenomenon we term text-image hallucination. To address this limitation, we propose Text-Aware Image Restoration (TAIR), a task requiring simultaneous recovery of visual content and textual fidelity. For this purpose, we introduce SA-Text, a large-scale benchmark of 100K high-quality scene images with dense annotations of diverse and complex text instances. We further present a multi-task diffusion framework, TeReDiff, which leverages internal features of diffusion models to jointly train a text-spotting module with the restoration module. This design allows intermediate text predictions from the text-spotting module to condition the diffusion-based restoration process during denoising, thereby enhancing text recovery. Extensive experiments demonstrate that our approach faithfully restores textual regions, outperforms existing diffusion-based methods, and achieves new state-of-the-art results on TextZoom, an STISR benchmark considered a subtask of TAIR. The code, weights, and dataset will be publicly released.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 2889

Loading