Title: ResShift: Residual Shifting Diffusion for Ultra-Efficient Image Super-Resolution

Abstract: Diffusion models have achieved remarkable success in image super-resolution (SR), yet their practical adoption is severely hampered by prohibitively slow inference speeds, often demanding hundreds to thousands of sampling steps. Current acceleration techniques typically compromise reconstruction quality, yielding perceptually inferior and over-blurry SR outputs. This paper introduces ResShift, a novel and highly efficient diffusion model specifically designed for SR that drastically reduces the required number of diffusion steps, thereby obviating the need for post-hoc acceleration and its associated performance degradation. ResShift constructs a unique Markov chain that directly bridges high-resolution (HR) and low-resolution (LR) image spaces by progressively shifting the residual information between them, leading to significantly enhanced transition efficiency. Furthermore, we develop an elaborate, flexible noise schedule that precisely controls the residual shifting speed and noise intensity throughout the diffusion process. Comprehensive experiments on both synthetic and real-world datasets demonstrate that ResShift achieves superior or at least comparable performance to leading state-of-the-art methods, notably requiring only 15 sampling steps. This represents a substantial leap in efficiency without sacrificing quality. Code and pre-trained models are publicly available at https://github.com/zsyOAOA/ResShift.

Section: Introduction
Image super-resolution (SR) is a critical and inherently ill-posed problem in computer vision, focused on reconstructing a high-resolution (HR) image from a given low-resolution (LR) counterpart. The complexity and unknown nature of real-world degradation models make this task particularly challenging. Recently, diffusion models [1,2] have emerged as powerful generative models, achieving state-of-the-art performance in diverse image generation tasks [3]. Their potential has also been recognized in various low-level vision applications, including image editing [4,5], inpainting [6,7], and colorization [8,9]. Consequently, there is significant interest in adapting diffusion models for the demanding SR task.

Existing diffusion-based SR methods generally follow two main strategies: either directly integrating the LR image as input to a standard diffusion model (e.g., DDPM [2]) and retraining it for SR [10,11], or utilizing an unconditional pre-trained diffusion model as a prior and guiding its reverse process with additional constraints to generate HR images [7,12,13,14]. A fundamental limitation of both approaches is their reliance on the conventional Markov chain structure of DDPM, which necessitates hundreds or even thousands of sampling steps for high-quality inference. While various acceleration techniques [15,16,17] have been developed to reduce sampling steps, they invariably lead to a noticeable degradation in performance, often manifesting as over-smooth and perceptually inferior SR results, as illustrated by the DDIM [16] accelerated inference in Fig. 1. This trade-off between efficiency and performance highlights a critical gap: the need for a novel diffusion model for SR that inherently achieves both fast inference and superior reconstruction quality without compromise.

This paper addresses this challenge by proposing ResShift, an efficient diffusion model specifically designed for SR. Unlike traditional diffusion models that gradually transform data into a generic Gaussian noise prior, ResShift constructs a unique Markov chain that directly and efficiently bridges the HR and LR image spaces by progressively shifting the residual information between them. This design fundamentally improves transition efficiency, enabling high-quality SR with a significantly reduced number of diffusion steps. We argue that for SR, where the LR image provides a strong initial cue, a prior distribution based on the LR image is more appropriate than a standard Gaussian. This iterative recovery from the LR counterpart, rather than from pure noise, naturally leads to faster convergence and improved inference efficiency.

Our method involves a shorter Markov chain that transitions between the HR image and its corresponding LR image. The chain's initial state approximates the HR image distribution, while its final state converges to an approximate LR image distribution. This is achieved through a carefully designed transition kernel that shifts the residual between them step by step. This residual-shifting mechanism is substantially more efficient than existing diffusion-based SR methods, as the essential information transfer occurs within dozens of steps. Furthermore, our formulation yields an analytically tractable and concise expression for the evidence lower bound, simplifying the derivation of the optimization objective for training. Building upon this novel diffusion kernel, we introduce a highly flexible noise schedule that precisely controls both the residual shifting speed and the noise strength at each step. This schedule provides fine-grained control over the fidelity-realism trade-off of the recovered results through tunable hyper-parameters.

In summary, the main contributions of this work are:
• We introduce ResShift, an efficient diffusion model for SR that performs iterative sampling from the LR image to the desired HR output by dynamically shifting the residual between them during inference. This novel approach significantly enhances efficiency.
• Extensive experiments unequivocally demonstrate the superiority of ResShift in terms of efficiency and quality. It achieves compelling results with only 15 sampling steps, consistently outperforming or matching the performance of current diffusion-based SR methods that demand considerably longer sampling processes. A qualitative preview of our results against existing methods is presented in Fig. 1.
• We develop a highly flexible and adaptive noise schedule for the proposed diffusion model, enabling unprecedented control over the residual shifting dynamics and noise levels throughout the transition process, thereby facilitating optimal fidelity-realism balance.

Section: Methodology
In this section, we detail ResShift, a novel diffusion model specifically engineered for the image super-resolution (SR) task. For clarity, we denote the low-resolution (LR) image as y 0 and the high-resolution (HR) image as x 0 . To ensure consistent processing, we assume y 0 and x 0 possess identical spatial resolutions; this can be readily achieved by pre-upsampling the LR image y 0 using nearest neighbor interpolation when necessary.

Section: Model Design
Inspired by the remarkable success of iterative generation paradigms in diffusion models for capturing complex data distributions, we adopt a similar iterative approach for SR. Our proposed ResShift constructs a unique Markov chain that directly bridges the HR and LR image spaces, as conceptually illustrated in Fig. 2. This design allows the SR task to be accomplished through an efficient reverse sampling process, conditioned on any given LR image. We now elaborate on the construction of this specialized Markov chain for SR.
Forward Process. We define the residual between the LR and HR images as e 0 , such that e 0 = y 0 -x 0 . The core innovation of ResShift is to transition from x 0 to y 0 by progressively shifting this residual e 0 across a Markov chain of length T . To govern this shift, we introduce a shifting sequence {η t } T t=1 , which monotonically increases with timestep t, satisfying the boundary conditions η 1 → 0 and η T → 1. The transition distribution is then precisely formulated based on this sequence:
q(x t |x t-1 , y 0 ) = N (x t ; x t-1 + α t e 0 , κ 2 α t I), t = 1, 2, • • • , T,(1)
where α t = η t -η t-1 for t > 1 and α 1 = η 1 . Here, κ is a crucial hyper-parameter that controls the noise variance, and I represents the identity matrix. A notable advantage of this formulation is that the marginal distribution at any timestep t is analytically tractable: q(x t |x 0 , y 0 ) = N (x t ; x 0 + η t e 0 , κ
2 η t I), t = 1, 2, • • • , T.(2)
The design of the transition distribution in Eq. (1) is guided by two fundamental principles. First, regarding the standard deviation κ √ α t , its purpose is to ensure a smooth and controlled transition between consecutive states x t and x t-1 . This smoothness is guaranteed because the expected distance between x t and x t-1 is bounded by √ α t , assuming image data is normalized to [0, 1]. Specifically, max[(x 0 + η t e 0 ) -
(x 0 + η t-1 e 0 )] = max[α t e 0 ] < α t < √ α t ,(3)
where max[•] denotes the pixel-wise maximum operation. The hyper-parameter κ further enhances the flexibility of this design. Second, the mean parameter, x 0 +α t e 0 , is meticulously chosen to induce the marginal distribution shown in Eq. (2). Consequently, the marginal distributions of x 1 and x T converge to δ x0 (•) 1 and N (•; y 0 , κ 2 I), respectively, serving as effective approximate distributions for the HR and LR images. This thoughtful construction of the Markov chain enables the SR task to be efficiently tackled through inverse sampling, conditioned on the LR image y 0 .
Reverse Process. The primary objective of the reverse process is to accurately estimate the posterior distribution p(x 0 |y 0 ) using the following formulation:
p(x 0 |y 0 ) = p(x T |y 0 ) T t=1 p θ (x t-1 |x t , y 0 )dx 1:T ,(4)
where p(x T |y 0 ) ≈ N (x T |y 0 , κ 2 I), and p θ (x t-1 |x t , y 0 ) represents the inverse transition kernel from x t to x t-1 , parameterized by a learnable set of parameters θ. Consistent with most diffusion model literature [1,2,8], we assume p θ (x t-1 |x t , y 0 ) = N (x t-1 ; µ θ (x t , y 0 , t), Σ θ (x t , y 0 , t)). The optimization of θ is achieved by minimizing the negative evidence lower bound (ELBO), specifically:
min θ t D KL [q(x t-1 |x t , x 0 , y 0 )∥p θ (x t-1 |x t , y 0 )] ,(5)
where D KL [•∥•] denotes the Kullback-Leibler (KL) divergence. For a more comprehensive mathematical treatment, readers are referred to Sohl-Dickstein et al. [1] and Ho et al. [2].
By combining Eq. (1) and Eq. (2), the target distribution q(x t-1 |x t , x 0 , y 0 ) in Eq. (5) becomes analytically tractable and can be explicitly expressed as:
q(x t-1 |x t , x 0 , y 0 ) = N x t-1 η t-1 η t x t + α t η t x 0 , κ 2 η t-1 η t α t I .(6)
The detailed derivation of this expression is provided in the supplementary material. Given that the variance parameter is independent of x t and y 0 , we set Σ θ (x t , y 0 , t) = κ 2 ηt-1 ηt α t I. For the mean parameter µ θ (x t , y 0 , t), we adopt the following reparameterization:
µ θ (x t , y 0 , t) = η t-1 η t x t + α t η t f θ (x t , y 0 , t),(7)
where f θ is a deep neural network, parameterized by θ, tasked with predicting x 0 . Through extensive experimentation with various parameterization forms for µ θ , we empirically found that Eq. (7) consistently yields superior stability and performance.
Based on Eq. (7), the objective function in Eq. (5) can be simplified into a more practical form:
min θ t w t ∥f θ (x t , y 0 , t) -x 0 ∥ 2 2 ,(8)
where the weight term is defined as:
w t = αt 2κ 2 ηtηt-1 .
In practice, we empirically observe that omitting the weight w t leads to a notable improvement in performance, a finding consistent with the conclusions drawn in Ho et al. [2].
Extension to Latent Space. To further mitigate the computational burden during training and inference, we extend the proposed ResShift model to operate within the latent space of a pre-trained VQGAN [22]. This approach compresses the original image by a factor of four in spatial dimensions. This extension requires no modifications to our model's core architecture, only substituting the original image representations x 0 and y 0 with their corresponding latent codes. An intuitive illustration of this integrated framework is presented in Fig. 2.

Section: Noise Schedule
The proposed ResShift method employs two key components to define its diffusion process noise schedule: a hyper-parameter κ and a shifting sequence {η t } T t=1 . Specifically, κ governs the overall noise intensity throughout the transition, and its empirical impact on performance is thoroughly discussed in Section 4.2. The subsequent discussion primarily focuses on the construction of the shifting sequence {η t } T t=1 . Equation (2) indicates that the noise level in state x t is directly proportional to √ η t , scaled by κ.
This crucial observation motivates us to design √ η t directly, rather than η t . Drawing on prior work by Song and Ermon [23], it is established that κ √ η 1 must be sufficiently small (e.g., 0.04 in LDM [11]) to ensure that q(x 1 |x 0 , y 0 ) ≈ q(x 0 ).
Considering the additional constraint that η 1 → 0, we set η 1 as the minimum value between (0.04/κ) 2 and 0.001. For the final timestep T , we set η T to 0.999, ensuring that η T → 1. For intermediate timesteps, i.e., t ∈ [2, T -1], we propose a non-uniform geometric schedule for √ η t defined as:
√ η t = √ η 1 × b βt 0 , t = 2, • • • , T -1,(9)
where
β t = t -1 T -1 p × (T -1), b 0 = exp 1 2(T -1) log η T η 1 . (10
)
Note that the specific choices for β t and b 0 are derived from the assumptions that
β 1 = 0, β T = T -1, and √ η T = √ η 1 × b T -1 0
. The hyper-parameter p plays a vital role in controlling the growth rate of √ η t , as visually demonstrated in Fig. 3(h).
The proposed noise schedule offers significant flexibility across three key dimensions. First, for small values of κ, the final state x T converges to a subtle perturbation around the LR image, as depicted in Fig. 3(c)-(d). This design, which deviates from the conventional corruption ending in pure Gaussian noise, substantially shortens the effective length of the Markov chain, thereby markedly improving inference efficiency. Second, the hyper-parameter p provides precise control over the residual shifting speed, enabling an effective fidelity-realism trade-off in the SR results, as thoroughly analyzed in Section 4.2. Third, by carefully setting κ = 40 and p = 0.8, our method can achieve a diffusion process remarkably similar to that of LDM [11]. This similarity is visually confirmed by the diffusion process results presented in Fig. 3(e)-(f) and further quantitatively supported by comparisons of relative noise strength, as shown in Fig. 3(g).

Section: Related Work
Diffusion Model. Originating from non-equilibrium statistical physics, diffusion models were first introduced by Sohl-Dickstein et al. [1] to effectively model complex data distributions. Ho et al. [2] subsequently established a crucial connection between diffusion models and denoising score matching. Later, Song et al. [8] proposed a unified framework for diffusion models from the perspective of stochastic differential equations (SDEs). Thanks to their robust theoretical foundations, diffusion models have achieved impressive success in generating diverse data modalities, including images [3,11], audio [24], graphs [25], and 3D shapes [26].
Image Super-Resolution. Traditional image SR methods predominantly focused on developing more sophisticated image priors based on expert knowledge, such as non-local similarity [27], low-rankness [28], and sparsity [29,30]. With the advent of deep learning (DL), Dong et al. [31] pioneered the field with SRCNN, demonstrating the efficacy of deep neural networks for SR. Since then, DL-based SR methods have rapidly dominated research, exploring various aspects including novel network architectures [32,33,34,35], advanced image priors [36,37,38,39], deep unfolding techniques [40,41,42], and realistic degradation models [18,19,43,44].
Recently, the application of diffusion models to SR has garnered significant attention. A common approach involves concatenating the LR image with noise at each step and retraining a diffusion model from scratch [10,11,45]. Another prevalent strategy is to leverage an unconditional pre-trained diffusion model as a prior, integrating additional constraints to guide its reverse process [7,12,13,46]. Both strategies typically require hundreds or even thousands of sampling steps to generate a perceptually realistic HR image. While various acceleration algorithms [15,16,17] have been proposed, they often lead to a compromise in performance, resulting in perceptually blurry outputs. Our work addresses this fundamental trade-off by designing a more inherently efficient diffusion model, as thoroughly detailed in Section 2.
Remark. Several concurrent works [47,48,49] also explore iterative restoration paradigms for SR. Despite sharing a similar motivation, our work distinguishes itself by adopting a unique mathematical formulation. Delbracio and Milanfar [47] utilized Inversion by Direct Iteration (InDI), while Luo et al. [48] and Liu et al. [49] formulated the process using SDEs. In contrast, this paper designs a discrete Markov chain to precisely depict the transition between HR and LR images, offering a more intuitive and computationally efficient solution to the SR problem.

Section: Experiments
This section provides a comprehensive empirical analysis of the proposed ResShift model, presenting extensive experimental results to validate its effectiveness across one synthetic dataset and three real-world datasets. Consistent with prior works [18,19], our investigation specifically targets the more challenging ×4 SR task. Due to page limitations, additional experimental results and details are deferred to the supplementary material.

Section: Experimental Setup
Training Details. High-resolution (HR) images, each with a resolution of 256 × 256, are randomly cropped from the ImageNet [50] training set, following the methodology of LDM [11]. Low-resolution (LR) images are synthesized using the robust degradation pipeline established by RealESRGAN [19]. ResShift is trained using the Adam [51] optimizer with default PyTorch [52] settings and a mini-batch size of 64. We employ a fixed learning rate of 5e-5 and update the weight parameters for 500K iterations. For the network architecture, we adopt the UNet structure commonly used in DDPM [2]. To enhance ResShift's robustness to arbitrary image resolutions, we replace the standard self-attention layers within the UNet with advanced Swin Transformer [53] blocks.
Testing Datasets. We construct a synthetic testing dataset comprising 3000 images randomly selected from the ImageNet [50] validation set. These LR images are generated based on the widely-used degradation model: y = (x * k) ↓ +n, where k denotes the blurring kernel, n represents the noise, and y and x are the LR and HR images, respectively. To provide a comprehensive evaluation of ResShift's performance, we incorporate more intricate types of blurring kernels, downsampling operators, and noise types. Detailed settings for these degradations can be found in the supplementary material. Notably, we chose HR images from ImageNet [50] over prevalent SR datasets like Set5 [54], Set14 [55], and Urban100 [56]. This decision is motivated by the fact that these smaller datasets contain a limited number of source images, which is insufficient for thoroughly evaluating the performance of various methods under diverse degradation types. For convenience, we refer to this dataset as ImageNet-Test.
Two real-world datasets are utilized to assess the practical efficacy of ResShift. The first is RealSR [57], which contains 100 real-world images captured by Canon 5D3 and Nikon D810 cameras. Additionally, we curate a second real-world dataset named RealSet65. This dataset includes 35 LR images frequently employed in recent literature [19,58,59,60,61], supplemented by 30 images we collected independently from the internet.
Compared Methods. We conduct a rigorous evaluation of ResShift against seven leading state-of-the-art (SotA) SR methods: ESRGAN [62], RealSR-JPEG [63], BSRGAN [18], RealESRGAN [19], SwinIR [20], DASR [21], and LDM [11]. It is important to note that LDM is a diffusion-based method trained with 1,000 diffusion steps. For a fair comparison, we accelerate LDM to match ResShift's sampling step count using DDIM [16] and denote this variant as "LDM-A," where "A" specifies the number of inference steps. The hyper-parameter η in DDIM is set to 1, as this value empirically yields the most perceptually realistic recovered images.
Metrics. The performance of various methods is quantitatively assessed using five distinct metrics: PSNR, SSIM [64], LPIPS [65], MUSIQ [66], and CLIPIQA [67]. It is crucial to highlight that MUSIQ and CLIPIQA are non-reference metrics specifically designed to evaluate the perceptual realism of images. CLIPIQA, in particular, benefits from the powerful representational capabilities inherited from the CLIP [68] model, which is pre-trained on a massive dataset (i.e., Laion400M [69]), thereby exhibiting strong generalization ability. For real-world datasets, we primarily rely on CLIPIQA and MUSIQ as the key evaluation metrics to compare the performance of different methods, as they better reflect human perception.

Section: Model Analysis
In this section, we present a detailed analysis of ResShift's performance under varying configurations of diffusion steps T and the hyper-parameters p (from Eq. (10)) and κ (from Eq. (1)).
Diffusion Steps T and Hyper-parameter p. The proposed transition distribution in Eq. (1) is specifically designed to significantly reduce the required number of diffusion steps T in the Markov chain. Furthermore, the hyper-parameter p offers flexible control over the speed at which the residual shifts during the diffusion process. Table 1 summarizes ResShift's performance on the ImageNet-Test dataset under different configurations of T and p. The results clearly indicate that both T and p facilitate a trade-off between fidelity (measured by reference metrics such as PSNR, SSIM, and LPIPS) and realism (measured by non-reference metrics including CLIPIQA and MUSIQ) of the super-resolved outputs. For instance, as p increases, reference metrics generally improve, while non-reference metrics tend to deteriorate. Moreover, visual comparisons in Fig. 4 illustrate that a larger value of p can suppress the model's ability to hallucinate finer image details, leading to perceptually blurrier results.
Hyper-parameter κ. Equation (2) elucidates that κ is the dominant factor determining the noise strength in state x t . We report the influence of κ on ResShift's performance in Table 1. Combined with the visualizations in Fig. 4, our findings indicate that both excessively large and excessively small values of κ tend to over-smooth the recovered results, despite potentially yielding favorable PSNR and SSIM metrics. We observe that when κ is within the range of [1.0, 2.0], our method achieves the most realistic perceptual quality, as evidenced by CLIPIQA and MUSIQ scores, which is highly desirable for real-world applications. Consequently, we set κ to 2.0 for all experiments in this work.
Efficiency Comparison. To optimize inference efficiency, it is crucial to limit the number of diffusion steps T . However, reducing T can sometimes lead to a decrease in the realism of the restored HR images. To strike an optimal balance, the hyper-parameter p can be set to a relatively small value. Therefore, we configure our final model, named ResShift, with T = 15 and p = 0.3. Table 2 presents a comprehensive comparison of ResShift's efficiency and performance against the state-of-the-art (SotA) LDM [11] approach and three other prominent GAN-based methodologies on the ImageNet-Test dataset. The results clearly demonstrate that ResShift outperforms LDM [11] in terms of PSNR and LPIPS [65], while achieving a remarkable fourfold enhancement in computational efficiency compared to LDM-100. Despite its significant potential in alleviating the efficiency bottleneck of diffusion-based SR, ResShift's iterative sampling mechanism still results in slower inference speeds compared to current GAN-based methods. Addressing this limitation through further optimizations of the proposed method remains an important direction for our future work.
Perception-Distortion Trade-off. The well-known perception-distortion trade-off [70] is a fundamental phenomenon in the field of SR. Specifically, enhancing the generative capability of a restoration model—for instance, by increasing the sampling steps for a diffusion-based method or amplifying the weight of the adversarial loss for a GAN-based method—typically leads to a deterioration in fidelity preservation while simultaneously improving the perceptual authenticity of the restored images. This occurs primarily because models with powerful generative capabilities tend to hallucinate more high-frequency image structures, which, while visually appealing, may deviate from the underlying ground truth. To provide a comprehensive comparison between our ResShift and the current SotA diffusion-based method LDM, we plot their perception-distortion curves in Fig. 7. Here, perception and distortion are measured by LPIPS and mean squared error (MSE), respectively. This plot effectively illustrates the perception quality and reconstruction fidelity of ResShift and LDM across varying numbers of diffusion steps (i.e., 10, 15, 20, 30, 40, and 50). As clearly observed, the perception-distortion curve of our ResShift consistently lies beneath that of LDM, indicating its superior capacity in achieving a more favorable balance between perception and distortion.

Section: Evaluation on Synthetic Data
We present a rigorous comparative analysis of the proposed ResShift method against recent state-of-the-art (SotA) approaches on the ImageNet-Test dataset. The quantitative results are summarized in Table 3, and qualitative comparisons are provided in Fig. 5. Based on this comprehensive evaluation, several significant conclusions can be drawn: i) ResShift consistently exhibits superior or at least comparable performance across all five evaluation metrics, unequivocally affirming the effectiveness and overall superiority of the proposed method.
ii) The notably higher PSNR and SSIM values achieved by ResShift underscore its exceptional capacity to better preserve fidelity to the ground truth images. This significant advantage primarily stems from our meticulously designed diffusion model, which initiates from a subtle perturbation of the LR image, a departure from the conventional assumption of white Gaussian noise in methods like LDM. iii) When considering perceptual metrics such as LPIPS and CLIPIQA, which are specifically designed to gauge the perceptual quality and realism of the recovered images, ResShift also demonstrates clear superiority over existing methods. Furthermore, in terms of MUSIQ, our approach achieves competitive performance comparable to recent SotA methods. In summary, the proposed ResShift exhibits remarkable capabilities in generating more realistic results while simultaneously preserving high fidelity, a balance of paramount importance for the challenging task of SR.

Section: Evaluation on Real-World Data
Table 4 presents a comparative evaluation using CLIPIQA [67] and MUSIQ [66] metrics for various methods on two distinct real-world datasets. It is important to highlight that CLIPIQA, leveraging the powerful representative capabilities inherited from the CLIP model, demonstrates stable and robust performance in assessing the perceptual quality of natural images. The results in Table 4 unequivocally show that the proposed ResShift significantly surpasses existing methods in CLIPIQA scores, indicating that the restored outputs of ResShift better align with human visual and perceptive systems. In the context of MUSIQ evaluation, ResShift achieves competitive performance when compared to leading SotA methods, including BSRGAN [18], SwinIR [20], and RealESRGAN [19]. Collectively, our method demonstrates a promising capability in effectively addressing the complexities of real-world SR problems.
We provide visual comparisons of four representative real-world examples in Fig. 6, with additional examples available in the supplementary material. These examples span diverse scenarios, including comic art, text, facial images, and natural scenes, ensuring a comprehensive evaluation. A striking observation is that ResShift consistently produces more naturalistic image structures, as clearly evidenced by the intricate patterns on the beam in the third example and the nuanced details in the eyes of the person in the fourth example. We note that the recovered results of LDM appear excessively smooth when its inference steps are compressed to match ResShift's 15 steps, representing a significant deviation from its original training procedure's 1,000 steps. While other GAN-based methods may also succeed in hallucinating plausible structures to some extent, their outputs are frequently marred by noticeable artifacts.

Section: Conclusion
In this work, we introduced ResShift, an efficient and effective diffusion model specifically designed for image super-resolution (SR). Unlike conventional diffusion-based SR methods that typically demand a large number of iterative steps to achieve satisfactory results, ResShift significantly reduces this requirement. Extensive experiments conducted on both synthetic and challenging real-world datasets have consistently demonstrated the superior performance and efficiency of our proposed method. We are confident that our work represents a significant step forward and will inspire the development of even more efficient and highly effective diffusion models for tackling the complex SR problem.

Section:
Acknowledgement. This study is supported under the RIE2020 Industry Alignment Fund -Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).


References:
[b0] Jascha Sohl-Dickstein; Eric Weiss; Niru Maheswaranathan; Surya Ganguli (2015). Deep unsupervised learning using nonequilibrium thermodynamics. PMLR
[b1] Jonathan Ho; Ajay Jain; Pieter Abbeel (2020). Denoising diffusion probabilistic models.
[b2] Prafulla Dhariwal; Alexander Nichol (2021). Diffusion models beat gans on image synthesis.
[b3] Chenlin Meng; Yutong He; Yang Song; Jiaming Song; Jiajun Wu; Jun-Yan Zhu; Stefano Ermon (2021). Sdedit: Guided image synthesis and editing with stochastic differential equations.
[b4] Omri Avrahami; Dani Lischinski; Ohad Fried (2022). Blended diffusion for text-driven editing of natural images.
[b5] Andreas Lugmayr; Martin Danelljan; Andres Romero; Fisher Yu; Radu Timofte; Luc Van Gool (2022-06). Repaint: Inpainting using denoising diffusion probabilistic models.
[b6] Hyungjin Chung; Byeongsu Sim; Jong Chul; Ye  (2022). Come-closer-diffuse-faster: Accelerating conditional diffusion models for inverse problems through stochastic contraction.
[b7] Yang Song; Jascha Sohl-Dickstein; P Diederik; Abhishek Kingma; Stefano Kumar; Ben Ermon;  Poole (2021). Score-based generative modeling through stochastic differential equations.
[b8] Chitwan Saharia; William Chan; Huiwen Chang; Chris Lee; Jonathan Ho; Tim Salimans; David Fleet; Mohammad Norouzi (2022). Palette: Image-to-image diffusion models.
[b9] Chitwan Saharia; Jonathan Ho; William Chan; Tim Salimans; David J Fleet; Mohammad Norouzi (2022). Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
[b10] Robin Rombach; Andreas Blattmann; Dominik Lorenz; Patrick Esser; Björn Ommer (2022). High-resolution image synthesis with latent diffusion models.
[b11] Jooyoung Choi; Sungwon Kim; Yonghyun Jeong; Youngjune Gwon; Sungroh Yoon (2021). Ilvr: Conditioning method for denoising diffusion probabilistic models.
[b12] Zongsheng Yue; Chen Change Loy (2022). Difface: Blind face restoration with diffused error contraction.
[b13] Jianyi Wang; Zongsheng Yue; Shangchen Zhou; Kelvin C K Chan; Chen Change Loy (2023). Exploiting diffusion prior for real-world image super-resolution.
[b14] Alexander Quinn; Nichol ; Prafulla Dhariwal (2021). Improved denoising diffusion probabilistic models. PMLR
[b15] Jiaming Song; Chenlin Meng; Stefano Ermon (2021). Denoising diffusion implicit models.
[b16] Cheng Lu; Yuhao Zhou; Fan Bao; Jianfei Chen; Chongxuan Li; Jun Zhu (2022). DPM-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps.
[b17] Kai Zhang; Jingyun Liang; Luc Van Gool; Radu Timofte (2021). Designing a practical degradation model for deep blind image super-resolution.
[b18] Xintao Wang; Liangbin Xie; Chao Dong; Ying Shan (2021). Real-esrgan: Training real-world blind superresolution with pure synthetic data.
[b19] Jingyun Liang; Jiezhang Cao; Guolei Sun; Kai Zhang; Luc Van Gool; Radu Timofte (2021). Swinir: Image restoration using swin transformer.
[b20] Jie Liang; Hui Zeng; Lei Zhang (2022). Efficient and degradation-adaptive network for real-world image super-resolution.
[b21] Patrick Esser; Robin Rombach; Bjorn Ommer (2021). Taming transformers for high-resolution image synthesis.
[b22] Yang Song; Stefano Ermon (2019). Generative modeling by estimating gradients of the data distribution.
[b23] Nanxin Chen; Yu Zhang; Heiga Zen; Ron J Weiss; Mohammad Norouzi; William Chan (2020). WaveGrad: estimating gradients for waveform generation.
[b24] Chenhao Niu; Yang Song; Jiaming Song; Shengjia Zhao; Aditya Grover; Stefano Ermon (2020). Permutation invariant graph generation via score-based generative modeling.
[b25] Ruojin Cai; Guandao Yang; Hadar Averbuch-Elor; Zekun Hao; Serge Belongie; Noah Snavely; Bharath Hariharan (2020). Learning gradient fields for shape generation.
[b26] Weisheng Dong; Lei Zhang; Guangming Shi; Xin Li (2012). Nonlocally centralized sparse representation for image restoration. IEEE Transactions on Image Processing (TIP)
[b27] Shuhang Gu; Qi Xie; Deyu Meng; Wangmeng Zuo; Xiangchu Feng; Lei Zhang (2017). Weighted nuclear norm minimization and its applications to low level vision. International Journal of Computer Vision (IJCV)
[b28] Weisheng Dong; Lei Zhang; Guangming Shi; Xiaolin Wu (2011). Image deblurring and super-resolution by adaptive sparse domain selection and adaptive regularization. IEEE Transactions on Image Processing (TIP)
[b29] Shuhang Gu; Wangmeng Zuo; Qi Xie; Deyu Meng; Xiangchu Feng; Lei Zhang (2015). Convolutional sparse coding for image super-resolution.
[b30] Chao Dong; Chen Change Loy; Kaiming He; Xiaoou Tang (2015). Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
[b31] Wenzhe Shi; Jose Caballero; Ferenc Huszár; Johannes Totz; Rob Andrew P Aitken; Daniel Bishop; Zehan Rueckert;  Wang (2016). Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network.
[b32] Kai Zhang; Wangmeng Zuo; Yunjin Chen; Deyu Meng; Lei Zhang (2017). Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing (TIP)
[b33] Wei-Sheng Lai; Jia-Bin Huang; Narendra Ahuja; Ming-Hsuan Yang (2017). Deep laplacian pyramid networks for fast and accurate super-resolution.
[b34] Muhammad Haris; Gregory Shakhnarovich; Norimichi Ukita (2018). Deep back-projection networks for super-resolution.
[b35] Jingyun Liang; Kai Zhang; Shuhang Gu; Luc Van Gool; Radu Timofte (2021). Flow-based kernel prior with application to blind super-resolution.
[b36] Xintao Kelvin Ck Chan; Xiangyu Wang; Jinwei Xu; Chen Change Gu;  Loy (2021). GLEAN: Generative latent bank for large-factor image super-resolution.
[b37] Xingang Pan; Xiaohang Zhan; Bo Dai; Dahua Lin; Chen Change Loy; Ping Luo (2021). Exploiting deep generative prior for versatile image restoration and manipulation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
[b38] Zongsheng Yue; Qian Zhao; Jianwen Xie; Lei Zhang; Deyu Meng; K Kwan-Yee;  Wong (2022). Blind image super-resolution with elaborate degradation modeling on noise and kernel.
[b39] Kai Zhang; Wangmeng Zuo; Lei Zhang (2019). Deep plug-and-play super-resolution for arbitrary blur kernels.
[b40] Kai Zhang; Luc Van Gool; Radu Timofte (2020). Deep unfolding network for image super-resolution.
[b41] Jiahong Fu; Hong Wang; Qi Xie; Qian Zhao; Deyu Meng; Zongben Xu (2022). Kxnet: A model-driven deep neural network for blind super-resolution.
[b42] Kai Zhang; Wangmeng Zuo; Lei Zhang (2018). Learning a single convolutional super-resolution network for multiple degradations.
[b43] Chong Mou; Yanze Wu; Xintao Wang; Chao Dong; Jian Zhang; Ying Shan (2022). Metric learning based interactive modulation for real-world super-resolution.
[b44] Haoying Li; Yifan Yang; Meng Chang; Shiqi Chen; Huajun Feng; Zhihai Xu; Qi Li; Yueting Chen (2022). Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing
[b45] Bahjat Kawar; Michael Elad; Stefano Ermon; Jiaming Song (2022). Denoising diffusion restoration models.
[b46] Mauricio Delbracio; Peyman Milanfar (2023). Inversion by direct iteration: An alternative to denoising diffusion for image restoration.
[b47] Ziwei Luo; Zheng Fredrik K Gustafsson; Jens Zhao; Thomas B Sjölund;  Schön (2023). Image restoration with mean-reverting stochastic differential equations.
[b48] Guan-Horng Liu; Arash Vahdat; De-An Huang; Evangelos A Theodorou; Weili Nie (2023). Anima Anandkumar. I 2 SB: Image-to-image schrodinger bridge.
[b49] Jia Deng; Wei Dong; Richard Socher; Li-Jia Li; Kai Li; Li Fei-Fei (2009). Imagenet: A large-scale hierarchical image database.
[b50] P Diederik; Jimmy Kingma;  Ba (2015). Adam: A method for stochastic optimization.
[b51] Adam Paszke; Sam Gross; Francisco Massa; Adam Lerer; James Bradbury; Gregory Chanan; Trevor Killeen; Zeming Lin; Natalia Gimelshein; Luca Antiga (2019). Pytorch: An imperative style, high-performance deep learning library.
[b52] Ze Liu; Yutong Lin; Yue Cao; Han Hu; Yixuan Wei; Zheng Zhang; Stephen Lin; Baining Guo (2021). Swin transformer: Hierarchical vision transformer using shifted windows.
[b53] Marco Bevilacqua; Aline Roumy; Christine Guillemot; Marie Line Alberi-Morel (2012). Low-complexity single-image super-resolution based on nonnegative neighbor embedding.
[b54] Roman Zeyde; Michael Elad; Matan Protter (2012). On single image scale-up using sparse-representations. Springer
[b55] Jia-Bin Huang; Abhishek Singh; Narendra Ahuja (2015). Single image super-resolution from transformed self-exemplars.
[b56] Jianrui Cai; Hui Zeng; Hongwei Yong; Zisheng Cao; Lei Zhang (2019). Toward real-world single image super-resolution: A new benchmark and a new model.
[b57] David Martin; Charless Fowlkes; Doron Tal; Jitendra Malik (2001). A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics.
[b58] Yusuke Matsui; Kota Ito; Yuji Aramaki; Azuma Fujimoto; Toru Ogawa; Toshihiko Yamasaki; Kiyoharu Aizawa (2017). Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications
[b59] Andrey Ignatov; Nikolay Kobyshev; Radu Timofte; Kenneth Vanhoey; Luc Van Gool (2017). Dslr-quality photos on mobile devices with deep convolutional networks.
[b60] Kai Zhang; Wangmeng Zuo; Lei Zhang (2018). Ffdnet: Toward a fast and flexible solution for cnn-based image denoising. IEEE Transactions on Image Processing (TIP)
[b61] Xintao Wang; Ke Yu; Shixiang Wu; Jinjin Gu; Yihao Liu; Chao Dong; Yu Qiao; Chen Change Loy (2018). Esrgan: Enhanced super-resolution generative adversarial networks.
[b62] Xiaozhong Ji; Yun Cao; Ying Tai; Chengjie Wang; Jilin Li; Feiyue Huang (2020). Real-world super-resolution via kernel estimation and noise injection.
[b63] A C Zhou Wang; H R Bovik; E P Sheikh;  Simoncelli (2004). Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing (TIP)
[b64] Richard Zhang; Phillip Isola; Alexei A Efros; Eli Shechtman; Oliver Wang (2018). The unreasonable effectiveness of deep features as a perceptual metric.
[b65] Junjie Ke; Qifei Wang; Yilin Wang; Peyman Milanfar; Feng Yang (2021). Musiq: Multi-scale image quality transformer.
[b66] Jianyi Wang; Kelvin Ck Chan; Chen Change Loy (2023). Exploring clip for assessing the look and feel of images.
[b67] Alec Radford; Jong Wook Kim; Chris Hallacy; Aditya Ramesh; Gabriel Goh; Sandhini Agarwal; Girish Sastry; Amanda Askell; Pamela Mishkin; Jack Clark (2021). Learning transferable visual models from natural language supervision.
[b68] Christoph Schuhmann; Richard Vencu; Romain Beaumont; Robert Kaczmarczyk; Clayton Mullis; Aarush Katta; Theo Coombes; Jenia Jitsev; Aran Komatsuzaki (2021). Laion-400m: Open dataset of clip-filtered 400 million image-text pairs.
[b69] Yochai Blau; Tomer Michaeli (2018-06). The perception-distortion tradeoff.

Figures:
Figure fig_1: 2
Type: figure
Caption: Figure 2 :2Figure 2: Overview of the proposed method. It builds up a Markov chain between the HR/LR image pair by shifting their residual.
Data: 

Figure fig_2: 
Type: figure
Caption: (a) HR Image (f) Latent Diffusion Model (T=1000) (e) ResShift (κ=40, p=0.8, T=1000) Forward Process (h) (c) ResShift (κ=1.0, p=0.3, T=15) (d) ResShift (κ=2.0, p=0.3, T=15) (g) (b) Zoomed LR
Data: 

Figure fig_3: 3
Type: figure
Caption: Figure 3 :3Figure 3: Illustration of the proposed noise schedule. (a) HR image. (b) Zoomed LR image. (c)-(d) Diffused images of ResShift in timesteps of 1, 3, 5, 7, 9, 12, and 15 under different values of κ by fixing p = 0.3 and T = 15. (e)-(f) Diffused images of ResShift with a specified configuration of κ = 40, p = 0.8, T = 1000 and LDM [11] in timesteps of 100, 200, 400, 600, 800, 900, and 1000. (g) The relative noise intensity (vertical axes, measured by 1 /λsnr, where λ snr denotes the signal-to-noise ratio) of the schedules in (d) and (e) w.r.t. the timesteps (horizontal axes). (h) The shifting speed √ η t (vertical axes) w.r.t. to the timesteps (horizontal axes) across various configurations of p. Note that the diffusion processes in this figure are implemented in the latent space, but we display the intermediate results after decoding back to the image space for the purpose of easy visualization.
Data: 

Figure fig_4: 4
Type: figure
Caption: Figure 4 :4Figure 4: Qualitative comparisons of ResShift under different combinations of (T , p, κ). For example, "(15, 0.3, 2.0)" represents the recovered result with T = 15, p = 0.3, and κ = 2.0. Please zoom in for a better view.
Data: 

Figure fig_6: 5
Type: figure
Caption: Figure 5 :5Figure 5: Qualitative comparisons of different methods on two synthetic examples of the ImageNet-Test dataset. Please zoom in for a better view.
Data: 

Figure fig_7: 7
Type: figure
Caption: Figure 7 :7Figure 7: Perception-distortion trade-off of ResShift and LDM. The vertical and horizontal axes represent the strength of the perception and distortion, measured by LPIPS and MSE, respectively.
Data: 

Figure fig_9: 6
Type: figure
Caption: Figure 6 :6Figure 6: Qualitative comparisons on four real-world examples. Please zoom in for a better view.
Data: 

Figure tab_1: 1
Type: table
Caption: Performance comparison of ResShift on the ImageNet-Test under different configurations.
Data: ConfigurationsMetricsTpκPSNR↑SSIM↑LPIPS↓CLIPIQA↑MUSIQ↑1025.200.68280.25170.549250.66171525.010.67690.23120.592253.6596300.32.024.520.65850.22530.627355.79044024.290.65130.22250.646856.84825024.220.64830.22120.648956.84630.325.010.67690.23120.592253.65960.525.050.67450.23870.581652.4475151.02.025.120.67800.26130.531448.49642.025.320.68270.30500.460143.30603.025.390.58130.34320.404138.53240.524.900.67090.24370.570050.61011.024.840.66990.23540.591452.9933150.32.0 8.025.01 25.310.6769 0.68580.2312 0.25920.5922 0.523153.6596 49.318216.024.460.68910.27720.489846.9794

Figure tab_2: 2
Type: table
Caption: Efficiency and performance comparisons of ResShift to other methods on the dataset of ImageNet-Test. "LDM-A" represents the results achieved by accelerated the sampling steps of LDM[11] to "A". Running time is tested on NVIDIA Tesla V100 GPU on the x4 (64→ 256) SR task.
Data: MetricsMethods BSRGAN RealESRGAN SwinIR LDM-15 LDM-30 LDM-100 ResShiftPSNR↑24.4224.0423.9924.8924.4923.9025.01LPIPS↓0.2590.2540.2380.2690.2480.2440.231CLIPIQA↑0.5810.5230.5640.5120.5720.6200.592Runtime (s)0.0120.0130.0460.1020.1840.4130.105# Parameters (M)16.7016.7028.01113.60118.59

Figure tab_3: 3
Type: table
Caption: Quantitative results of different methods on the dataset of ImageNet-Test. The best and second best results are highlighted in bold and underline.
Data: MethodsPSNR↑SSIM↑Metrics LPIPS↓CLIPIQA↑MUSIQ↑ESRGAN [62]20.670.4480.4850.45143.615RealSR-JPEG [63]23.110.5910.3260.53746.981BSRGAN [18]24.420.6590.2590.58154.697SwinIR [20]23.990.6670.2380.56453.790RealESRGAN [19]24.040.6650.2540.52352.538DASR [21]24.750.6750.2500.53648.337LDM-15 [11]24.890.6700.2690.51246.419ResShift25.010.6770.2310.59253.660

Figure tab_4: 4
Type: table
Caption: Quantitative results of different methods on two real-world datasets. The best and second best results are highlighted in bold and underline. and mean square-error (MSE), respectively. This plot reflects the perception quality and the reconstruction fidelity of ResShift and LDM across varying numbers of diffusion steps, i.e., 10, 15, 20, 30, 40, and 50. As can be observed, the perception-distortion curve of our ResShift consistently resides beneath that of the LDM, indicating its superior capacity in balancing perception and distortion.
Data: DatasetsMethodsRealSRRealSet65CLIPIQA↑MUSIQ↑CLIPIQA↑MUSIQ↑ESRGAN [62]0.236229.0480.373942.369RealSR-JPEG [63]0.361536.0760.528250.539BSRGAN [18]0.543963.5860.616365.582SwinIR [20]0.465459.6360.578263.822RealESRGAN [19]0.489859.6780.599563.220DASR [21]0.362945.8250.496555.708LDM-15 [11]0.383649.3170.427447.488ResShift0.595859.8730.653761.330sured by LPIPS


Formulas:
Formula formula_0: q(x t |x t-1 , y 0 ) = N (x t ; x t-1 + α t e 0 , κ 2 α t I), t = 1, 2, • • • , T,(1)

Formula formula_1: 2 η t I), t = 1, 2, • • • , T.(2)

Formula formula_2: (x 0 + η t-1 e 0 )] = max[α t e 0 ] < α t < √ α t ,(3)

Formula formula_3: p(x 0 |y 0 ) = p(x T |y 0 ) T t=1 p θ (x t-1 |x t , y 0 )dx 1:T ,(4)

Formula formula_4: min θ t D KL [q(x t-1 |x t , x 0 , y 0 )∥p θ (x t-1 |x t , y 0 )] ,(5)

Formula formula_5: q(x t-1 |x t , x 0 , y 0 ) = N x t-1 η t-1 η t x t + α t η t x 0 , κ 2 η t-1 η t α t I .(6)

Formula formula_6: µ θ (x t , y 0 , t) = η t-1 η t x t + α t η t f θ (x t , y 0 , t),(7)

Formula formula_7: min θ t w t ∥f θ (x t , y 0 , t) -x 0 ∥ 2 2 ,(8)

Formula formula_8: w t = αt 2κ 2 ηtηt-1 .

Formula formula_9: √ η t = √ η 1 × b βt 0 , t = 2, • • • , T -1,(9)

Formula formula_10: β t = t -1 T -1 p × (T -1), b 0 = exp 1 2(T -1) log η T η 1 . (10

Formula formula_11: )

Formula formula_12: β 1 = 0, β T = T -1, and √ η T = √ η 1 × b T -1 0
