Keywords: Super-Resolution, Image restoration, Language, CLIP
Abstract: Single image super-resolution (SISR) aims to reconstruct high-resolution (HR) images from low-resolution (LR) inputs.
Training SR models typically requires paired HR–LR data, which is difficult to obtain in reality. As a result, most methods synthesize LR images by artificially degrading HR images with handcrafted kernels or camera ISP adjustments. However, these synthetic degradations fail to capture the complexity of real LR images, leading to poor generalization in practice. To address this, we observe that even within a single high-quality image, regions at different depths exhibit varying resolutions—where distant regions act as LR patches and closer ones as HR patches. This allows the extraction of real, degradation-induced LR patches from real images. Since these LR patches lack paired HR counterparts, we propose LA-SR (Language Assistant for SR), a novel framework for unpaired SR. The key idea of LA-SR is to redefine unpaired SR in the \emph{language space}, using vision-language models to bridge the LR–HR gap. LA-SR projects images into a semantic-rich space representing both content and quality, and applies two language-guided losses: linguistic-content loss to preserve semantic fidelity, and linguistic-quality loss to enhance perceptual realism. With this alignment, LA-SR effectively super-resolves real LR inputs, producing realistic outputs that overcome the limitations of synthetic-data-trained methods.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6995
Loading