Title: IMPROVING PROBABILISTIC DIFFUSION MODELS WITH OPTIMAL DIAGONAL COVARIANCE MATCHING

Abstract: Probabilistic diffusion models have achieved remarkable effectiveness across various domains. Typically, sampling from a diffusion model involves using a denoising distribution characterized by a Gaussian with a learned mean and either fixed or learned covariances. In this paper, we leverage the recently proposed covariance moment matching technique and introduce a novel, principled method for learning the diagonal covariance. Unlike traditional data-driven diagonal covariance approximation approaches, our method directly regresses the optimal diagonal analytic covariance using a new, unbiased objective named Optimal Covariance Matching (OCM). This innovative approach significantly reduces the approximation error in covariance prediction. We demonstrate how our method substantially enhances the sampling efficiency, recall rate, and likelihood of commonly used diffusion models, establishing a new state-of-the-art in these metrics.

Section: INTRODUCTION
Diffusion models (Sohl-Dickstein et al., 2015;Ho et al., 2020;Song & Ermon, 2019) have achieved remarkable success in modeling complex data across various domains (Rombach et al., 2022;Li et al., 2022;Poole et al., 2022;Ho et al., 2022;Hoogeboom et al., 2022;Liu et al., 2023). A conventional diffusion model operates in two stages: a forward noising process, indexed by t ∈ [1, T ], which progressively corrupts the data distribution into a Gaussian distribution via Gaussian convolution, and a reverse denoising process, which generates images by gradually transforming Gaussian noise back into coherent data samples. In traditional diffusion models, the generation process typically predicts only the mean of the denoising distribution while using a fixed, pre-defined variance (Ho et al., 2020). This approach often requires a very large number of steps, T , to produce high-quality, diverse samples or to achieve reasonable model likelihoods, leading to inefficiencies during inference.
To address this inefficiency, several works have proposed estimating the diagonal covariance of denoising distributions rather than relying on pre-defined variance values. For instance, Bao et al. (2022b) introduces an analytical form of isotropic, state-independent covariance that can be estimated from the data. This analytical solution achieves an optimal form under the constraints of isotropy and state independence. A more flexible approach involves learning a state-dependent diagonal covariance. Nichol & Dhariwal (2021) propose learning this form by optimizing the variational lower bound (VLB). Bao et al. (2022a) explore learning a state-dependent diagonal covariance directly from data, also examining its analytical form. These methods have demonstrated improved image quality with fewer denoising steps. Learning the covariance through VLB optimization (Nichol & Dhariwal, 2021) has become a widely adopted strategy in state-of-the-art image and video generative models.
Building on this line of research, the goal of our paper is to develop an improved denoising covariance strategy that enhances both the generation quality and likelihood evaluation while reducing the number of total time steps. Recently, Zhang et al. (2024b) derived the optimal state-dependent full covariance for the denoising distribution. While this method offers greater flexibility than statedependent diagonal covariance, it requires O(D 2 ) storage for the Hessian matrix and O(D) network evaluations per denoising step, where D is the data dimension. This makes it impractical for highdimensional applications, such as image generation. To address this limitation, we propose a novel, unbiased covariance matching objective that enables a neural network to match the diagonal of the optimal state-dependent diagonal covariance. Unlike previous methods (Bao et al., 2022a;Nichol & Dhariwal, 2021), which learn the diagonal covariance directly from the data, our approach estimates the diagonal covariance from the learned score function. We show that this method significantly reduces covariance estimation errors compared to existing techniques. Moreover, we demonstrate that our approach can be applied to both Markovian (DDPM) and non-Markovian (DDIM) diffusion models, as well as latent diffusion models. This results in improvements in generation quality, recall, and likelihood evaluation, while also reducing the number of function evaluations (NFEs).

Section: BACKGROUND OF PROBABILISTIC DIFFUSION MODELS
We first introduce two classes of diffusion models that will be explored further in our paper.

Section: MARKOVIAN DIFFUSION MODEL: DDPM
Let q(x 0 ) be the true data distribution, denoising diffusion probabilistic models (DDPM) (Ho et al., 2020) constructs the following Markovian process q(x 0:T ) = q(x 0 ) T t=1 q(x t |x t-1 ),
(1)
where q(x t |x t-1 ) = N ( √ 1 -β t x t-1 , β t I) and β 1:T is the pre-defined noise schedule. We can also derive the following skip-step noising distribution in a closed form
q(x t |x 0 ) = N ( √ ᾱt x 0 , (1 -ᾱt )I),(2)
where ᾱt ≡ t s=1 (1 -β s ). When T is sufficiently large, we have the marginal distribution q(x T ) → N (0, I). The generation process utilizes an initial distribution q(x T ) and the denoising distribution q(x t-1 |x t ). We assume for a large T and variance preversing schedule discussed in Equation 1, we can approximate q(x T ) ≈ p(x T ) = N (0, I). The true q(x t-1 |x t ) is intractable and is approximated as a variational distribution p θ (x t-1 |x t ) within a Gaussian family, which defines the reverse denoising process as p θ (x t-1 |x t ) = N (x t-1 |µ t-1 (x t ; θ), Σ t-1 (x t ; θ)).
(3) With Tweedie's Lemma (Efron, 2011;Robbins, 1992), we can have the following score representation
µ t-1 (x t ; θ) = (x t + β t ∇ xt log p θ (x t ))/ 1 -β t ,(4)
where the approximated score function ∇ xt log p θ (x t ) ≈ ∇ xt log q(x t ) can be learned by the denoising score matching (DSM) (Vincent, 2011;Song & Ermon, 2019).
For the covariance Σ t-1 (x t ; θ), two heuristic choices were proposed in the original DDPM paper: 1. Σ t-1 (x t ; θ) = β t , which is equal to the variance of q(x t |x t-1 ) and 2. Σ t-1 (x t ; θ) = βt where βt = (1 -ᾱt-1 )/(1 -ᾱt )β t is the variance of q(x t-1 |x t , x 0 ). Although heuristic, Ho et al. (2020) show that when T is large, both options yield similar generation quality.

Section: NON-MARKOVIAN DIFFUSION MODEL: DDIM
In additional to the Markovian diffusion process, denoising diffusion implicit models (DDIM) (Song et al., 2020) only defines the condition q(x t |x 0 ) = N ( √ ᾱt x 0 , (1 -ᾱt )I) and let
q(x t-1 |x t ) ≈ q(x t-1 |x t , x 0 )p θ (x 0 |x t )dx 0 ,(5)
where q(x t-1 |x t , x 0 ) = N (µ t-1 , σ 2 t-1 ) with
µ t-1 = √ ᾱt-1 x 0 + 1-ᾱ t-1 -σ 2 t-1 (x t - √ ᾱt x 0 )/ √ 1 -ᾱt .(6)
When σ t-1 = (1 -ᾱt-1 )/(1 -ᾱt )β t , the diffusion process becomes Markovian and equivalent to DDPM. Specifically, DDIM chooses σ → 0, which implicitly defines a non-Markovian diffusion process. In the original paper (Song et al., 2020), the q(x 0 |x t ) is heuristically chosen as a delta function p θ (x 0 |x t ) = δ(x 0 -µ 0 (x t ; θ)) where
µ 0 (x t ; θ) = (x t + (1 -ᾱt )∇ xt log p θ (x t ))/ √ ᾱt .(7)
In both DDPM and DDIM, the covariance of p θ (x t-1 |x t ) or p θ (x 0 |x t ) are chosen based on heuristics. Nichol & Dhariwal (2021) have shown that the choice of covariance makes a big impact when T is small. Therefore, for the purpose of accelerating the diffusion sampling, our paper will focus on how to improve the covariance estimation quality in these cases. We will first introduce our method in the next section and then compare it with other methods in Section 4.

Section: DIFFUSION MODELS WITH OPTIMAL COVARIANCE MATCHING
Recently, Zhang et al. (2024b) introduce the optimal covariance form of the denoising distribution q(x|x) ∝ q(x|x)q(x) for the Gaussian convolution q(x|x) = N (x, σ 2 I), which can be seen as a high-dimensional extension of the second-order Tweedie's Lemma (Efron, 2011;Robbins, 1992). We further extend the formula to scaled Gaussian convolutions in the following theorem.
Theorem 1 (Generalized Analytical Covariance Identity). Given a joint distribution q(x, x) = q(x|x)q(x) with q(x|x) = N (αx, σ 2 I), then the covariance of the true posterior q(x|x) ∝ q(x)q(x|x), which is defined as
Σ(x) = E q(x|x) [x 2 ] -E q(x|x) [x] 2
, has a closed form:
Σ(x) = σ 4 ∇ 2 x log q(x) + σ 2 I /α 2 . (8
)
See Appendix A.1 for a proof. This covariance form can also be shown as the optimal covariance form under the KL divergences (Bao et al., 2022b;Zhang et al., 2024b). We can see that the exact covariance in this case only depends on the score function, which indicates the exact covariance can be derived from the score function. In general, the score function already contains all the information of the denoising distribution q(x|x), see Zhang et al. (2024b) for further discussion. We here only consider the covariance for simplicity.   In the case of the diffusion model, we use the learned score function as a plug-in approximation in Equation ( 8). Although the optimal covariance can be directly calculated from the learned score function, but it requires calculating the Hessian matrix, which is the Jacobian of the score function. This requires O(D 2 ) storage and D number of network evaluation (Martens et al., 2012) for each denoising step at the time t. Zhang et al. (2024b) propose to use the following consistent diagonal approximation (Bekas et al., 2007) 
to remove the O(D 2 ) storage requirement diag(H(x)) ≈ 1/M M m=1 v m ⊙ H(x)v m ,(9)
where H(x) ≡ ∇ 2 x log q(x) and v m ∼p(v) is a Rademacher random variable (Hutchinson, 1990) with entries ±1 and ⊙ denotes the element-wise product. In Table 1, we compare the generation quality of DDPM using different covariance choices. The results demonstrate that generation quality improves significantly when using the Rademacher estimator to estimate the optimal covariance, compared to the heuristic choices of β and β. However, as shown in Figure 5 in the Appendix, achieving a desirable approximation on the CIFAR10 dataset necessitates M ≥ 100 Rademacher samples. Each calculation of v m ⊙ H(x)v m requires a forward pass and a backward propagation, leading to roughly 2M network evaluations in total. This significantly slows down the generation speed, making it impractical for diffusion models. Inspired by Nichol & Dhariwal (2021); Bao et al. (2022a) and also the amortization technique used in variational inference (Kingma & Welling, 2013;Dayan et al., 1995), we propose to use a network to match the diagonal Hessian, which only requires one network pass to predict the diagonal Hessian in the generation time and can be done in parallel with the score/mean predictions with no extra time cost. In the next section, we introduce a novel unbiased objective to learn the diagonal Hessian from the learned score, which improves the covariance estimation accuracy and leads to better generation quality and higher likelihood estimations. 
: Let x t ′ ← √ ᾱt ′ x0 + √ 1 -ᾱt ′ x t -√ ᾱt x 0 √ 1-ᾱt .

Section: UNBIASED OPTIMAL COVARIANCE MATCHING
To train a network h ϕ (x) to match the Hessian diagonal diag(H(x)), a straightforward solution is to directly regress Equation 9 for all the noisy data
min ϕ E q(x) ||h ϕ (x)- 1 M M m=1 v m ⊙H(x)v m || 2 2 ,(10)
where v m ∼ p(v). Although this objective is consistent when M → ∞, it will introduce additional bias when M is small. To avoid the bias, we propose the following unbiased optimal covariance matching (OCM) objective
L ocm (ϕ) = E q(x)p(v) ||h ϕ (x) -v ⊙ H(x)v|| 2 2 ,(11)
which does not include an expectation within the non-linear L2 norm. The following theorem shows the validity of the proposed OCM objective. Theorem 2 (Validity of the OCM objective). The objective in Equation (11) upper bounded the base objective (i.e., Equation (10) with M → ∞). Moreover, it attains optimal when h ϕ (x) = diag(H(x)) for all x ∼ q(x).
See the Appendix A.2 for a proof. The integration over v in Equation ( 11) can be unbiasedly approximated by the Monte-Carlo integration given M Rademacher samples v m ∼ p(v). In practice, we found M = 1 works well (see Table 11 in the appendix for the ablation study on varying M values), which also shows the training efficiency of the proposed OCM objective. The learned h ϕ can form the covariance approximation
Σ(x; ϕ) = (σ 4 h ϕ (x) + σ 2 I)/α 2 . (12
)
We then discuss how to apply the learned covariance to diffusion models.

Section: DIFFUSION MODELS APPLICATIONS
Given access to a learned score function, ∇ xt log p θ (x t ), from any pre-trained diffusion model, we denote the Jacobian of the score as H t (x t ). Assuming M = 1 in the OCM training objective, the covariance learning objective for diffusion models can be expressed as follows:
min ϕ 1 T T t=1 E q(xt,x0)p(v) ∥h ϕ (x t ) -v⊙H t (x t )v∥ 2 2 , (13
)
where v ∼ p(v) and h ϕ (x t ) is a network that conditioned on the state x t and time t. After training this objective, the learned h ϕ (x t ) can be used to form the diagonal Hessian approximation h ϕ (x) ≈ diag(H(x)) which further forms our approximation of covariance. We then derive its use cases in skip-step DDPM and DDIM for Diffusion acceleration.

Section: Skip-
Step DDPM: For the general skip-step DDPM with denoising distribution q(x t ′ |x t ) with t ′ < t.
When t ′ = t -1, this becomes the classic one-step DDPM. We further denote ᾱt ′ :t = t s=t ′ α s , and thus ᾱ0:t = ᾱt , ᾱt ′ :t = ᾱt /ᾱ t ′ . We can write the forward process as
q(x t |x t ′ ) = N (x t | √ ᾱt ′ :t x t ′ , (1 -ᾱt ′ :t )I).
(14) The corresponding Gaussian denoising distribution p θ,ϕ (x t ′ |x t ) = N (µ t ′ (x t ; θ), Σ t ′ (x t ; ϕ)) has the following mean and covariance functions:
µ t ′ (x t ; θ) = (x t + (1 -ᾱt ′ :t )∇ xt log p θ (x t ))/ √ ᾱt ′ :t ,(15)
Σ t ′ (x t ; ϕ) = ((1-ᾱt ′ :t ) 2 h ϕ (x t )+(1-ᾱt ′ :t )I)/ᾱ t ′ :t .(16)
Table 2: Overview of different covariance estimation methods. Methods are ranked by increasing modeling capability from top to bottom. We also include the intuition of the methods and how many additional network passes are required for estimating the covariance.

Section: Modeling Capability
Covariance Type +#Passes Intuition x t -independent Isotropic β (Ho et al., 2020) 0 Cov. of q(x t |x t-1 ) x t -independent Isotropic β (Ho et al., 2020) 0 Cov. of q(x t |x t-1 , x 0 ) x t -independent Isotropic Estimation (Bao et al., 2022b) 0 Estimate from data x t -dependent Diagonal VLB (Nichol & Dhariwal, 2021) 1 Learn from data x t -dependent Diagonal NS (Bao et al., 2022a) 1 Learn from data x t -dependent Diagonal OCM (Ours) 1 Learn from score x t -dependent Diagonal Estimation (Zhang et al., 2024b) 2M Estimate from score x t -dependent Diagonal Analytic (Zhang et al., 2024b) D Calculate from score x t -dependent Full Analytic (Zhang et al., 2024b) D Calculate from score
The skip-step denoising sample x t ′ ∼ p θ,ϕ (x t ′ |x t ) can be obtained by
x t ′ = µ t ′ (x t ; θ) + ϵΣ 1/2 t ′ (x t ; ϕ). Skip-
Step DDIM: Similarly, we give the skip-step formulation of DDIM, where we set the σ t-1 = 0 in Equation ( 6) which is also used in the original paper Song et al. (2020). We can use the approximated covariance of p θ,ϕ (x 0 |x t ) to replace the delta function used in the vanilla DDIM, which gives the skip-steps DDIM sample
x t ′ = √ ᾱt ′ x 0 + √ 1 -ᾱt ′ / √ 1 -ᾱt • (x t - √ ᾱt x 0 ),(17)
where
x 0 ∼ p θ,ϕ (x 0 |x t ) = N (µ 0 (x t ; θ), Σ 0 (x t ; ϕ)) and µ 0 (x t ; θ) = (x t + (1 -ᾱt )∇ xt log p θ (x t ))/ √ ᾱt ,(18)
Σ 0 (x t ; ϕ) = ((1 -ᾱt ) 2 h ϕ (x t , t) + (1 -ᾱt )I)/ᾱ t .(19)
The sample x 0 ∼ p θ,ϕ (x 0 |x t ) can be obtained by x 0 = µ 0 (x t ; θ) + ϵΣ 1/2 0 (x t ; ϕ). In the next section, we will compare the proposed method to other covariance estimation methods in practical examples.

Section: DETAILS OF TRAINING AND INFERENCE
Our model comprises two components: a score prediction network s θ and a diagonal Hessian prediction network h ϕ . In line with Bao et al. (2022a), we parameterize the score prediction network using a pretrained diffusion model, and the Hessian prediction network is parameterized by sharing parameters as follows:
s θ (x t ) = NN 1 (BaseNet(x t , t; θ 1 ); θ 2 ), h ϕ (x t ) = NN 2 (BaseNet(x t , t; θ 1 ); ϕ)(20)
where BaseNet represents the commonly used architecture in diffusion models, such as UNet and DiT (Ho et al., 2020;Peebles & Xie, 2023). This parameterization approach only requires an additional small neural network, NN 2 , resulting in negligible extra computational and memory costs compared to the original diffusion models (see Appendix B.1 for more details). In our experiment, we fix the parameter θ = {θ 1 , θ 2 } and train the Hessian prediction network exclusively with the proposed OCM objective. After training, samples can be generated using Algorithms 1 and 2.

Section: RELATED COVARIANCE ESTIMATION METHODS
We then discuss different choices of Σ t ′ (x t ) used in the diffusion model literature, see also Table 2 for an overview. For brevity, we mainly focus on DDPM here, in which the mean µ t ′ (x t ) can be computed as in Equation ( 15).
1. x t -independent isotropic covariance: β/ β-DDPM (Ho et al., 2020). The β-DDPM uses the variance of p(
x t ′ |x t ), which is Σ t ′ (x t ) = (1 -ᾱt /ᾱ t ′ )I, when t ′ = t -1, we have Σ t-1 (x t ) = β t . The β-DDPM uses the covariance of p(x t ′ |x 0 , x t ), which is (1-ᾱt ′ ) (1-ᾱt) (1 -ᾱt ′ :t ).
2. x t -independent isotropic covariance: A-DDPM (Bao et al., 2022b). A-DDPM assumes a state-independent isotropic covariance Σ t ′ (x t ) = σ 2 t ′ I with the following analytic form of
σ 2 t ′ = 1-ᾱt ′ :t ᾱt ′ :t - (1-ᾱt ′ :t ) 2 dᾱ t ′ :t E q(xt) ∥∇ xt log p θ (x t )∥ 2 2 ,
where the integration of q(x t ) requires a Monte Carlo estimation before conducting generation. This variance choice is optimal under the KL divergence within a constrained isotropic state-independent posterior family.
3. x t -dependent diagonal covariance: I-DDPM (Nichol & Dhariwal, 2021). In I-DDPM, the diagonal covariance matrix is modelled as the interpolation between β t and βt
Σ t ′ (x t ; ψ) = exp(v ψ (x t ) log β t + (1 -v ψ (x t )) log βt ),(21)
where v ψ is parametrized via a neural network. The covariance is learned with the variational lower bound (VLB). In the optimal training, the covariance learned by VLB will recover the true covariance in (8). Notably, Nichol & Dhariwal (2021) heuristically obtain the covariance of skip sampling Σ t ′ (x t ; ψ) by rescaling β t and βt accordingly:
β t → 1 -ᾱt ′ :t , βt → (1-ᾱt ′ ) (1-ᾱt) (1 -ᾱt ′ :t ).
However, when t ′ = 0, Σ 0 (x t ; ψ) is ill-defined; thus, iDDPM is inapplicable within the DDIM framework.
4. x t -dependent diagonal covariance: SN-DDPM (Bao et al., 2022a). SN-DDPM learns the covariance by training a neural network g ψ to estimate the second moment of the noise
ϵ t = (x t - √ ᾱt x 0 )/ √ 1 -ᾱt : min ψ E t,q(x0,xt) ∥ϵ 2 t -g ψ (x t )∥ 2 2 . (22
)
After training, the covariance Σ t ′ (x t ; ψ) can be estimated via
Σ t ′ (x t ; ψ) = (1-ᾱt ′ ) (1 -ᾱt ) βt ′ :t I + β 2 t α t g ψ (x t ) 1-ᾱt -∇ xt logp θ (x t ) 2 . (23
)
where βt ′ :t = 1 -ᾱt ′ :t . In optimal, Σ t ′ (x t ; ψ) in Equation ( 23) will recover the true covariance in Equation ( 8). We demonstrate the equivalence between OCM-DDPM and SN-DDPM in Appendix A.3. However, due to the appearance of the quadratic term in Equation ( 23), SN-DDPM tends to amplify the estimation error as t → 0, leading to suboptimal solutions. To mitigate this issue, Bao et al. (2022a) also propose NPR-DDPM, which models the noise prediction residual instead. We recommend referring to their paper for detailed explanations.
Notably, almost all these methods can be applied within the DDIM framework by setting t ′ = 0. Specifically, p(x t ′ |x t ) can be sampled using Equation ( 17) with x 0 ∼ N (µ 0 (x t ), Σ 0 (x t )), where µ 0 (x t ) is the same as in ( 18), but Σ 0 (x t ) differs for various methods as discussed previously.

Section: EXPERIMENTAL RESULTS
To support our theoretical discussion, we first evaluate the performance of optimal covariance matching by training diffusion probabilistic models on 2D toy examples. We then demonstrate its effectiveness in enhancing image modelling in both pixel and latent spaces, focusing on the comparison between optimal covariance matching and other covariance estimation methods, and showing that the proposed approach has the potential to scale to large image generation tasks.

Section: TOY DEMONSTRATION
We first demonstrate the effectiveness of our method by considering the data distribution as a twodimensional mixture of forty Gaussians (MoG) with means uniformly distributed over [-40, 40] ⊗ [-40, 40] and a standard deviation of σ = √ 40, where ⊗ denotes Cartesian product (see Figure 1a for visualization). In this case, both the true score ∇ xt log p(x t ) and the optimal covariance Σ t ′ (x t ) are available, allowing us to compare the covariance estimation error. We then learn the covariance using the true scores by different methods to conduct DDPM and DDIM sampling for this MoG problem. For evaluation, we employ the Maximum Mean Discrepancy (MMD) (Gretton et al., 2012), which utilizes five kernels with bandwidths {2 -2 , 2 -1 , 2 0 , 2 1 , 2 2 }. In Figures 4c and4d, we show the MMD comparison among different methods. Specifically, we choose to conduct different diffusion steps with the skip-step scheme as discussed in Section 3.2.    1c, demonstrate that the proposed method outperforms other covariance learning approaches. Additionally, we include two methods utilizing the true diagonal and full covariance, serving as benchmarks for the best achievable performance. Notably, in this case, using the full covariance yields better performance compared to the diagonal covariance. This observation highlights the importance of accurate covariance approximation, suggesting that improved methods for learning the off-diagonal terms could lead to even better results. For further analysis, we present an additional toy experiment in Appendix C.1 to demonstrate the efficacy of our approach.

Section: IMAGE MODELLING WITH DIFFUSION MODELS
Following the experimental setting in Bao et al. (2022a), we then evaluate our method across varying pre-trained score networks provided by previous works (Ho et al., 2020;Nichol & Dhariwal, 2021;Song et al., 2020;Bao et al., 2022b;Peebles & Xie, 2023). In this experiment, we mainly focus on four datasets: CIFAR10 (Krizhevsky et al., 2009) with the linear schedule (LS) of β t (Ho et al., 2020) and the cosine schedule (CS) of β t (Nichol & Dhariwal, 2021); CelebA (Liu et al., 2015); LSUN Bedroom (Yu et al., 2015). The details of the experimental setting can be found in Appendix B.

Section: IMPROVING LIKELIHOOD RESULTS
We first evaluate the negative log-likelihood (NLL) of the diffusion models by calculating the negative evidence lower bound (ELBO): -log q(x 0 ) ≤ E q log p(x 0:T ) q(x 1:T |x0) ≡ -L elbo (x 0 ) with the same mean yet varying covariances. As per Bao et al. (2022b), we report results only for the DDPM forward process, as -L elbo = ∞ under the DDIM forward process. As shown in Table 3, our OCM-DDPM demonstrates the best or second-best performance in most scenarios, highlighting the effectiveness of our covariance estimation approach. We also note that SN-DDPM performs poorly on likelihood results in small-scale datasets like CIFAR10 and CELEBA, likely due to the amplified error of the quadratic term in Equation ( 22). Although NPR-DDPM achieves slightly better performance in certain scenarios, it falls short in terms of sample quality, as measured by FID, shown in Table 5. In Appendix C.2, we also compare our method to Improved DDPM (Nichol & Dhariwal, 2021) in terms of likelihood on ImageNet 64x64. The results show that our method achieves better likelihood estimation with a small number of timesteps, while achieving comparable results with full timesteps. Notably, while SN-DDPM often demonstrates slightly better FID performance in certain cases, it struggles with likelihood estimation, as shown in Table 3. Conversely, as discussed in the previous section, NPR-DDPM sometimes achieves superior likelihood results in Table 3, but its image generation quality is significantly worse than OCM-DDPM. Therefore, the proposed OCM-DDPM provides a more balanced model, offering a better trade-off between generation quality and likelihood estimation. To better highlight this benefit, we include a visualization in Figure 2. We generate samples using 10 timesteps with varying CFG coefficients (see Table 12 for exact numerical values).

Section: IMAGE MODELLING WITH LATENT DIFFUSION MODELS
In this section, we apply our methods to latent diffusion models (Vahdat et al., 2021;Rombach et al., 2022) and with ImageNet 256×256 to demonstrate the scalability of our method. Specifically, we compare our methods to other approaches within the DiT architecture (Peebles & Xie, 2023), evaluating sample quality using the FID score and diversity using the Recall score (Sajjadi et al., 2018). We focus on conditional generation with classifier-free guidance (CFG), which is consistent to Peebles & Xie (2023).
It is important to note that DiT was initially trained using the Improved DDPM (I-DDPM) algorithm (Nichol & Dhariwal, 2021). Therefore, the results of I-DDPM reflect the original performance of the pre-trained DiT model1 provided by Peebles & Xie (2023). For DDPM and DDIM sampling, we discard the learned covariance in DiT and use only the learned noise prediction neural network for the posterior mean. In contrast, our OCM-DPM retains the same mean function while learning the covariance using the optimal covariance matching objective. We then generate samples with 10 sampling steps and report FID and Recall scores across different CFG coefficients on the ImageNet 256x256 dataset. The results are displayed in Figure 3, with DDPM-β excluded due to its poor performance. It shows that our OCM-DPM method achieves the best FID performance with CFG = 2.0. Although DDPM-β and DDIM perform better in terms of FID at CFG = 1.5, their Recall scores To provide an overview of the superiority of our method, we compare the minimum sampling step to achieve the FID score (Heusel et al., 2017) close to 6 across different approaches (see Appendix B.1 for details). As shown in Table 4, our method requires the fewest denoising steps in most settings. Moreover, we emphasize that compared to recent advanced methods like SN-DPM and NPR-DPM, our approach is the only one that consistently delivers both competitive likelihood and FID.

Section: RELATED WORK AND FUTURE DIRECTIONS
In this paper, we show that improving diagonal covariance estimation can significantly enhance the performance of diffusion models. This raises a natural question: can more flexible covariance structures further improve these models? In Figure 1, we demonstrate that a full covariance structure achieves better generation quality with fewer NFEs in a toy 2D problem. However, for highdimensional problems, the quadratic growth in parameter scale makes a full covariance approach computationally impractical. To address this, low-rank or block-diagonal covariance approximations offer promising alternatives. Developing effective training objectives for flexible covariance structures remains a compelling direction for future research.
In addition to the covariance estimation methods discussed in Section 4, there are other approaches to accelerate sampling in diffusion models. One approach involves using faster numerical solvers for differential equations with continuous timesteps (Jolicoeur-Martineau et al., 2021;Liu et al., 2022;Lu et al., 2022). Another strategy, inspired by Schrödinger bridge (Wang et al., 2021;De Bortoli et al., 2021), is to introduce a nonlinear, trainable forward diffusion process. Additionally, replacing Gaussian modelling of p θ (x 0 |x t ) with more expressive alternatives, such as GANs (Xiao et al., 2021), distributional models (Bortoli et al., 2025), latent variable models (Yu et al., 2024), or energy-based models (Xu et al., 2024), can also accelerate the sampling of diffusion models.
Recently, distillation techniques have gained popularity, achieving state-of-the-art in one-step generation (Zhou et al., 2024). There are two prominent types of distillation methods. The first is trajectory distillation (Salimans & Ho, 2022;Berthelot et al., 2023;Song et al., 2023;Heek et al., 2024;Kim et al., 2023;Li & He, 2024), which focuses on accelerating the process of solving differential equations. The second type involves distillation techniques that utilize a one-step implicit latent variable model as the student model (Luo et al., 2024;Salimans et al., 2024;Xie et al., 2024;Zhou et al., 2024;Zhang et al., 2025) and distills the diffusion process into the student model by minimizing the spread divergence family (Zhang et al., 2020;2019) through score estimation (Poole et al., 2022;Wang et al., 2024).
Although these distillation methods typically offer faster generation speeds (fewer NFEs) compared to covariance estimation methods, they often lack tractable density or likelihood estimation. This presents a challenge for generative modeling applications where likelihood or density estimation is  (Zhang et al., 2024a;Vargas et al., 2023;Chen et al., 2024;Akhound-Sadegh et al., 2024). Another example is diffusion model-based data compression, where better likelihood corresponds to better compression rates (Townsend et al., 2019;Zhang et al., 2021;Ho et al., 2020;Kingma et al., 2021). In these cases, the proposed OCM-DDPM can be used to provide better likelihood estimates, resulting in improved task performance. Additionally, a recent paper (Salimans et al., 2024) shows how moment-matching improves distillation, achieving state-of-the-art diffusion quality with first-order matching. Our method could extend this to higher-order moment matching, potentially accelerating distillation training. We leave it as a promising direction for future work.
Beyond image modelling, our method can be straightforwardly applied to accelerate large-scale video diffusion models (Blattmann et al., 2023;Chen et al., 2023), which are based on latent diffusion models. A recent study (Zhao et al., 2024) shows that covariance estimation is crucial in mitigating the image-leakage issue in image-to-video generation problems, presenting another promising application of our method. Moreover, our method can be applied to solve inverse problems, where the optimal covariance can enhance the accuracy of posterior sampling (Chung et al., 2022;Rozet et al., 2024). While this work focuses on the DDPM and DDIM sampler, incorporating the improved covariance into the general stochastic and deterministic differential solver (Song et al., 2021;Karras et al., 2022) also represents exciting directions for future research.

Section: CONCLUSION
In this paper, we proposed a new method for learning the diagonal covariance of the denoising distribution, which offers improved accuracy compared to other covariance learning approaches.
Our results demonstrate that enhanced covariance estimation leads to better generation quality and diversity, all while reducing the number of required generation steps. We validated the scalability of our method across different problem domains and scales and discussed its connection to various acceleration techniques and highlighted several promising application areas for future exploration.
Σ(x) = σ 4 ∇ 2 x log q(x) + σ 2 I /α 2 . (8
)
Proof. This proof generalizes the original analytical covariance identity proof discussed in Zhang et al. (2024b). Using the fact that ∇ xq(x|x) = -1 σ 2 (x -αx)q(x|x) and the Tweedie's formula ∇ x log q(x) = 1 σ 2 αE q(x|x) [x] -x , we can expand the hessian of the log q θ (x):
∇ 2 x log q(x) = - 1 σ 2 ∇ x (x -αx) q(x|x)q(x) q(x) dx = - 1 σ 2 q(x|x)q(x) q(x) dx - 1 σ 2 (x -αx) ∇ xq(x|x)q(x)q(x) -∇ xq(x)q(x|x)q(x) q 2 (x) dx =⇒ σ 2 ∇ 2 x log q(x) + 1 = -(x -αx) ∇ xq(x|x)q(x) -∇ x log q(x)q(x|x)q(x) q(x) dx = -(x -αx) -1 σ 2 (x -αx)q(x|x)q(x) + 1 σ 2 (x -αE q(x|x) [x])q(x|x)q(x) q(x) dx =⇒ σ 4 ∇ 2 x log q(x) + σ 2 I = (x -αx) 2 -(x -αx)(x -αE q(x|x) [x]) q(x|x) dx = α 2 E q(x|x) [x 2 ] -α 2 E q(x|x) [x] 2 ≡ α 2 Σ(x)
Therefore, we obtain the analytical full covariance identity: Σ q (x) = σ 4 ∇ 2 x log q(x) + σ 2 I /α 2 .

Section: A.2 VALIDITY OF THE OCM OBJECTIVE
Theorem 2 (Validity of the OCM objective). The objective in Equation (11) upper bounded the base objective (i.e., Equation (10) with M → ∞). Moreover, it attains optimal when h ϕ (x) = diag(H(x)) for all x ∼ q(x).
Proof. Recall that to learn the optimal covariance, we can minimize the following loss function
min ϕ E q(x) h ϕ (x) -E q(v) [v ⊙ H(x)v] 2 2 (24)
We call this the grounded objective because it remains unbiased and consistent, i.e., h ϕ * = diag(H(x)), when the inner expectation is approximated with infinite Monte Carlo samples. Using Jensen's inequality, we can show that the OCM objective defined in ( 11) provides an upper bound for the grounded objective
E q(x) h ϕ (x) -E q(v) [v ⊙ H(x)v] 2 2 = E q(x) E q(v) [h ϕ (x) -v ⊙ H(x)v] 2 2 ≤ E q(x) E q(v) ∥h ϕ (x) -v ⊙ H(x)v∥ 2 2 = L ocm (ϕ).
Thus, minimizing the OCM objective also minimizes the grounded objective, leading to a more accurate approximation of the diagonal Hessian. We then show that these two objectives are equivalent when attaining their optimal. To see this, we can expand the OCM objective
L ocm (ϕ) = E p(v)q(x) ||h ϕ (x) -v ⊙ H(x)v|| 2 2 = E q(x) ||h ϕ (x)|| 2 2 -2E p(v)q(x) [h ϕ (x) T (v ⊙ H(x)v)] + c = E q(x) ||h ϕ (x)|| 2 2 -2E q(x) [h ϕ (x) T diag(H(x))] + c = E q(x) ||h ϕ (x) -diag(H(x))|| 2 2 + c ′
, where c, c ′ are constants w.r.t. the parameter ϕ, and line 2 to line 3 follows from the fact that E p(v) [v ⊙ H(x)v] = diag(H(x)). Therefore, L ocm (ϕ) attains optimal when h ϕ (x) = diag(H(x)) for all x ∼ q(x).

Section: A.3 CONNECTION TO SN-DDPM
In this section, we showcase the connection between OCM-DDPM and SN-DDPM through the second-order Tweedie's formula. Before delving into it, we present the following lemmas, which are essential for our derivation. Lemma 1 (First order Tweedie's formula (Efron, 2011)). Let q(x|x) = N (x| √ αx, βI), we have the mean of the inverse density equals
E q(x|x) [x] = 1 √ α (x + β∇ x log q(x)). (25
)
Lemma 2 (Second order Tweedie's formula). Let q(x|x) = N (x| √ αx, βI), we have the second moment of the inverse density equals
E q(x|x) [xx T ] = 1 α xx T + βs 1 (x)x T + β xs 1 (x) T + β 2 s 2 (x) + β 2 s 1 (x)s 1 (x) T + βI , (26
)
where s 1 (x) ≡ ∇ x log q(x) and s 2 (x) ≡ ∇ 2 x log q(x).
Proof. The proof follows that in (Meng et al., 2021, Appendix B) with the generalization to the scaled Gaussian convolutions. Specifically, we first reparametrized q(x|x) as a exponential distribution
q(x|η) = e η T x-ψ(η) q 0 (x),(27)
where
η = √ α β x, q 0 (x) ∝ e -1 2β
xT x, and ψ(η) denotes the partition function. By applying the Bayes rule q(η|x) = q(x|η)p(η) q(x)
, we have the corresponding posterior
q(η|x) = e η T x-ψ(η)-λ(x) p(η),(28)
where λ(x) ≡ log q(x) -log q 0 (x). Since q(η|x)dη = 1, by taking the derivative w.r.t. x on both sides, we have
(η -∇ xλ(x)) T q(η|x) = 0,(29)
which implies that E[η|x] = ∇ xλ(x). Taking the derivative w.r.t. x on both sides again, we have
η (η -∇ xλ(x)) T q(η|x) = ∇ 2 xλ(x), (30
) which implies that E[ηη T |x] = ∇ 2 xλ(x) + ∇ xλ(x)∇ T x λ(x). By substituting η = √ α β x, ∇ xλ(x) = s 1 (x) + 1
β x, and ∇ 2 xλ(x) = s 2 (x) + 1 β I, we get the result as desired. Lemma 3 (Convert the covariance of q(x|x) to the hessian of q(x)). Let q(x|x) = N (x| √ αx, βI), we have the covariance of the inverse density equals
Cov q(x|x) [x] = β α (I + β∇ 2 x log q(x)). (31
)
Proof. Let s 1 (x) ≡ ∇ x log q(x) and s 2 (x) ≡ ∇ 2 x log q(x). We have
Cov q(x|x) [x] = β 2 α Cov q(x|x) x - √ αx β = β 2 α E q(x|x) x- √ αx β x- √ αx β T -E q(x|x)
x-
√ αx β E q(x|x) x- √ αx β T = β 2 α 1 β 2 E q(x|x) x - √ αx x - √ αx T -s 1 (x)s 1 (x) T = β 2 α 1 β 2 xx T -2x(x + βs 1 (x)) T + αE q(x|x) [xx T ] -s 1 (x)s 1 (x) T = β 2 α 1 β 2 β 2 s 2 (x) + β 2 s 1 (x)s 1 (x) T + βI -s 1 (x)s 1 (x) T = β 2 α 1 β I + s 2 (x) ≡ β α I + β∇ 2 x log q(x) ,
where line 2 to line 3 follows Lemma 1 and line 4 to line 5 follows Lemma 2.
It is noteworthy that line 3 in the proof of Lemma 3 also showcases the connection between the covariance of q(x|x) and the score of q(x):
Cov q(x|x) [x] = β 2 α 1 β E q(x|x) [ϵϵ T ] -∇ x log q(x)∇ x log q(x) T ,(32)
where ϵ = (x -√ αx)/ √ β. Now we can apply Equations ( 31) and ( 32) to establish the connection between OCM-DDPM and SN-DDPM, as demonstrated in the following theorem. Theorem 3 (Connection between OCM-DDPM and SN-DDPM). Suppose q(x 0:T ) is defined as Equation (1), and the pre-trained score function is well-learned, i.e., ∇ xt log p θ (x t ) = ∇ xt log q(x t ), ∀x t . Let h ϕ , g ψ be parameterized neural networks. OCM-DDPM learns h ϕ by minimizing the objective L OCM (ϕ) as in Equation (13), and SN-DDPM learns g ψ by minimizing the objective L SN (ψ) as in Equation (22). Then in optimal training with ϕ * = argmin L OCM (ϕ) and ψ * = argmin L SN (ψ), we have the optimal diagonal covariance of OCM-DDPM Σ
t-1 (x t ; ϕ * ) = (1-αt) 2 h ϕ * (xt)+(1-αt)I αt and that of SN-DDPM Σ t-1 (x t ; ψ * ) = 1-ᾱt-1 1-ᾱt β t + β 2 t αt g ψ * (xt) 1-ᾱt -∇ xt logp θ (x t ) 2 are identical.
Proof. In optimal training, we know that Bao et al. (2022b) show that the covariance of the denoising density q(x t-1 |x t ) has a closed form (see Lemma 13 in Bao et al. (2022b))
h ϕ * (x t ) = diag(∇ 2 xt log q(x t )) and g ψ * (x t ) = E q(x0|xt) [ϵ 2 t ].
Cov q(xt-1|xt) [x t-1 ] = λ 2 t I + γ 2 t Cov q(x0|xt) [x 0 ],(33)
where
λ 2 t 1-ᾱt-1 1-ᾱt β t and γ t = √ ᾱt-1 -1 -ᾱt-1 -λ 2 t ᾱt 1-ᾱt = √ ᾱt-1 βt 1-ᾱt . Since q(x t |x 0 ) = N ( √ ᾱt x 0 , (1 -ᾱt )I), applying Lemma 3 gives diag(Cov q(x0|xt) [x 0 ]) = (1 -ᾱt ) ᾱt (I + (1 -ᾱt )h ϕ * (x t )).
Substituting it into Equation (33), we have
diag(Cov q(xt-1|xt) [x t-1 ]) = (1 -α t ) 2 h ϕ * (x t ) + (1 -α t )I α t = Σ t-1 (x t ; ϕ * )
as desired. Alternatively, applying Equation ( 32), we have
diag(Cov q(x0|xt) [x 0 ]) = (1 -ᾱt ) 2 ᾱt g ψ * (x t ) 1 -ᾱt -∇ xt logp θ (x t ) 2 .
Substituting it into Equation (33) gives
diag(Cov q(xt-1|xt) [x t-1 ]) = 1 -ᾱt-1 1 -ᾱt β t + β 2 t α t g ψ * (x t ) 1-ᾱt -∇ xt logp θ (x t ) 2 = Σ t-1 (x t ; ψ * ). Therefore, diag(Cov q(xt-1|xt) [x t-1 ]) = Σ t-1 (x t ; ϕ * ) = Σ t-1 (x t ; ψ * ) as desired.
Remark. Theorem 3 establishes the connection between OCM-DDPM and SN-DDPM using the second-order Tweedie's formula. This connection allows OCM-DDPM to utilize the same covariance clipping trick as in SN-DDPM (see Appendix B.2 for details). Additionally, as highlighted in Bao et al. (2022a), SN-DDPM suffers from error amplification in the quadratic term, a limitation not present in OCM-DDPM. As shown in Figure 4 and Table 3, OCM-DDPM demonstrates superior covariance estimation accuracy and likelihood performance compared to SN-DDPM, underscoring the advantages of the proposed optimal covariance matching objective.

Section: B DETAILS OF EXPERIMENTS B.1 DETAILS OF MODEL ARCHITECTURES
In this section, we describe our model architectures in detail. Following the approach in Bao et al. (2022a), our model comprises a pretrained score neural network with fixed parameters, along with a trainable diagonal Hessian prediction network built on top of it. Details of Pretrained Score Prediction Networks. Table 6 lists the pretrained neural networks utilized in our experiments. These models parameterize the noise prediction ϵ θ (x), allowing us to derive the score prediction as
s θ (x t ) = ∇ x log p θ (x t ) = -ϵ θ (xt) √ 1-ᾱt
, following the forward process defined in Equation (1). It is important to note that the pretrained networks for ImageNet 64x64 and 256x256 include both noise prediction networks and covariance networks. In our model architectures, we utilize only the noise prediction networks. 

Section: Details of Diagonal Hessian Prediction
Networks. For fair comparisons, we follow the parameterization as per Bao et al. (2022a) for all models excluding the one on ImageNet 256x256, which was not explored in their paper. The architecture details of NN 1 and NN 2 are provided in Table 7, where Conv denotes the convolutional layer, Res denotes the residual block 8: The number of parameters and the averaged time (ms) to run a model function evaluation. All are evaluated with a batch size of 64 on one A100-80GB GPU. Details of Table 4. In Table 4, the FID results of baselines are taken from Bao et al. (2022a). Notably, the time cost for a single neural function evaluation is identical across DDPM, DDIM, and Analytic-DPM, all of which use a fixed isotropic covariance in the backward Markov process.
The ratio of the time cost to the baselines is based on Table 8. For the FID results of our model, OCM-DPM, we report performance using the DDPM forward process for LSUN Bedroom and the DDIM forward process for the other datasets.

Section: B.2 DETAILS OF TRAINING, INFERENCE, AND EVALUATION
Our training and inference recipes largely follow those outlined in Bao et al. (2022a); Peebles & Xie (2023). Below, we provide a detailed description.
Training Details. We use the AdamW optimizer (Loshchilov, 2017) with a learning rate of 0.0001 and train for 500K iterations across all datasets. The batch sizes are set to 64 for LSUN Bedroom, 128 for CIFAR10, CelebA 64x64, and ImageNet 64x64, and 256 for ImageNet 256x256. During training, checkpoints are saved every 10K iterations, and we select the checkpoint with the best FID on 2048 samples generated with full sampling steps2 . We train our models using one A100-80G GPU for CIFAR10, CelebA 64x64, and ImageNet 64x64; four A100-80G GPUs for LSUN Bedroom; and eight A100-80G GPUs for ImageNet 256x256.
Sampling Details. As per Bao et al. (2022b), covariance clipping is crucial to the performance in diffusion models with unfixed variance in the backward Markovian. Leveraging the connection between OCM-DPM and SN-DPM established in Theorem 3, we can apply the same clipping strategies outlined in Bao et al. (2022a;b). Specifically, we only display the mean of p(x 0 |x 1 ) at the last sampling and clip the covariance Σ 1 (x 2 ) of p(x 1 |x 2 ) such that ∥Σ 1 (x 2 )∥ ∞ E|ϵ| ≤ 2 255 y, where ∥•∥ ∞ denotes the infinity norm and ϵ is the standard Gaussian noise. Following Bao et al. (2022b), we use y = 2 on CIFAR10 (LS) and CelebA 64x64 under the DDPM forward process, and use y = 1 for other cases. For all skip-step sampling methods, we employ the even trajectory (Nichol & Dhariwal, 2021) for selecting the subset of sampling steps.
Evaluation Details. The performance is evaluated on the exponential moving average (EMA) model with a rate of 0.9999. For computing the negative log-likelihood, we follow Ho et al. (2020); Bao et al. (2022a) by discretizing the last sampling step p(x 0 |x 1 ) to obtain the likelihood of discrete image data and report the upper bound of the negative log-likelihood on the entire test set. The FID score is computed on 50K generated samples. Following Nichol & Dhariwal (2021); Bao et al. Table 9: The corresponding mean and deviation of NLL and FID reported in Tables 3 and5. Note that the standard deviation is reported under the scale of percentage (%).  

Section: C.1 ADDITIONAL TOY DEMONSTRATION
To further verify the effectiveness of our method, we include an additional toy example. In this case, we consider another two-dimensional mixture of nine Gaussians (MoG) with means located at {-3, 0, 3} ⊗ {-3, 0, 3} a standard deviation of σ = 0.1. To assess different approaches, we first learn the covariance using the true scores. Specifically, we train the covariance networks for these methods over 50,000 iterations with a learning rate of 0.001 using an Adam optimizer (Kingma & Ba, 2014). Figure 4a shows the L 2 error of the estimated diagonal covariance Σ t-1 (x t ), and our method, OCM-DDPM, consistently achieves the lowest error compared to the other methods. In practice, the true score is not accessible. Therefore, we also conduct a comparison using a score function learned with DSM. We then apply different covariance estimation methods under the same settings. In Figure 4b, we plot the same L 2 error for the learned score setting and find that our method achieves the lowest error for all t values.
We further use the learned score and the covariance estimated by different methods to conduct DDPM and DDIM sampling for this MoG problem. Figures 4c and4d presents the MMD comparison across various methods. It shows that our method outperforms the baselines when the total time step is small, demonstrating the importance of accurate covariance estimation in the context of diffusion acceleration. We also include two methods utilizing the true diagonal and full covariance as benchmarks, representing the best achievable performance. Notably, these two methods exhibit similar performance because the MoG used in this setting has symmetric components, in which the covariance is dominated by the diagonal entries. Here, we provide additional likelihood comparisons against I-DDPM (Nichol & Dhariwal, 2021) on the ImageNet 64x64 dataset. As discussed in Section 4, I-DDPM parameterizes the diagonal covariance by interpolating between β and and learns the covariance by maximizing the variational lower bound. For ease of comparison, we also include the performance of NPR-DDPM and SN-DDOM, which employ an alternative MSE loss as detailed in Equation ( 22) to learn the covariance. Table 10 presents the results, showing that OCM-DDPM achieves the best likelihood with fewer sampling steps while maintaining performance comparable to I-DDPM when using full sampling steps. This highlights the superiority of the proposed optimal covariance matching objective in learning a diffusion model for data density estimation. In Tables 3 and5, we evaluate our models using the same random seed as Bao et al. (2022a). To minimize the impact of randomness, we report the mean and standard deviation in We report the performance using 10 sampling steps with varying CFG coefficients in Table 12.

Section: C.2 ADDITIONAL LIKELIHOOD COMPARISON


Section: C.3 IMPACT OF THE NUMBERS OF RADEMACHER SAMPLES
The results indicate that our methods perform best at CFG = 2.0. While DDPM-β and DDIM show strong FID scores, their Recall is lower due to fixed variance, suggesting less diversity in the generated samples. In Table 13, we further compare the performance across different sampling steps at CFG = 1.5. The results again demonstrate that our methods, which estimate the optimal covariance from the data, produce more diverse samples while maintaining comparable image quality.

Section: C.6 GENERATED SAMPLES
In this section, we conduct qualitative studies by showcasing the generated samples from our models using different sampling steps K. The results are summarized as follows:
• In Figure 6, we visualize the generated samples using varying numbers of Monte Carlo samples with the Rademacher estimator (refer to Table 1).
• In Figure 7, we visualize the training data and the generated samples using the minimum number of sampling steps required to achieve an FID of approximately 6 (refer to Table 4).
• In Figures 8 to 11, we visualize the generated samples of our models using different number of sampling steps on CIFAR10 (LS), CIFAR10 (CS), CelebA 64x64, and ImageNet 64x64, respectively (refer to Table 5).
• In Figure 12, we visualize the generated samples on ImageNet 256x256, using 10 sampling steps with different CFG coefficients. (refer to Figure 3 and Table 12).
• In Figure 13, we visualize the generated samples on ImageNet 256x256, using varying number of sampling steps with CFG set to 1.5 (refer to Table 13).      5).

Section: ACKNOWLEDGMENTS
ZO is supported by the Lee Family Scholarship. MZ and DB acknowledge funding from AI Hub in Generative Models, under grant EP/Y028805/1 and funding from the Cisco Centre of Excellence. We want to thank Wenlin Chen for the useful discussions.

Section: A ABSTRACT PROOF AND DERIVATIONS
A.1 DERIVATIONS OF THE ANALYTICAL FULL COVARIANCE IDENTITY Theorem 1 (Generalized Analytical Covariance Identity). Given a joint distribution q(x, x) = q(x|x)q(x) with q(x|x) = N (αx, σ 2 I), then the covariance of the true posterior q(x|x) ∝ q(x)q(x|x), which is defined as
, has a closed form:  


References:
[b0] T Akhound-Sadegh; J Rector-Brooks; A J Bose; S Mittal; P Lemos; C.-H Liu; M Sendera; S Ravanbakhsh; G Gidel; Y Bengio (2024). Iterated denoising energy matching for sampling from boltzmann densities. 
[b1] F Bao; C Li; J Sun; J Zhu; B Zhang (2022). Estimating the optimal covariance with imperfect mean in diffusion probabilistic models. 
[b2] F Bao; C Li; J Zhu; B Zhang (2022). Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. 
[b3] C Bekas; E Kokiopoulou; Y Saad (2007). An estimator for the diagonal of a matrix. Applied numerical mathematics
[b4] D Berthelot; A Autef; J Lin; D A Yap; S Zhai; S Hu; D Zheng; W Talbott; E Gu (2023). Tract: Denoising diffusion models with transitive closure time-distillation. 
[b5] A Blattmann; T Dockhorn; S Kulal; D Mendelevitch; M Kilian; D Lorenz; Y Levi; Z English; V Voleti; A Letts (2023). Stable video diffusion: Scaling latent video diffusion models to large datasets. 
[b6] V D Bortoli; A Galashov; J S Guntupalli; G Zhou; K Murphy; A Gretton; A Doucet (2025). Distributional diffusion models with scoring rules. 
[b7] H Chen; M Xia; Y He; Y Zhang; X Cun; S Yang; J Xing; Y Liu; Q Chen; X Wang (2023). Videocrafter1: Open diffusion models for high-quality video generation. 
[b8] W Chen; M Zhang; B Paige; J M Hernández-Lobato; D Barber (2024). Diffusive gibbs sampling. 
[b9] H Chung; J Kim; M T Mccann; M L Klasky; J C Ye (2022). Diffusion posterior sampling for general noisy inverse problems. 
[b10] P Dayan; G E Hinton; R M Neal; R S Zemel (1995). The helmholtz machine. Neural computation
[b11] V De Bortoli; J Thornton; J Heng; A Doucet (2021). Diffusion schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems
[b12] B Efron (2011). Tweedie's formula and selection bias. Journal of the American Statistical Association
[b13] A Gretton; K M Borgwardt; M J Rasch; B Schölkopf; A Smola (2012). A kernel two-sample test. The Journal of Machine Learning Research
[b14] K He; X Zhang; S Ren; J Sun (2016). Deep residual learning for image recognition. 
[b15] J Heek; E Hoogeboom; T Salimans (2024). Multistep consistency models. 
[b16] M Heusel; H Ramsauer; T Unterthiner; B Nessler; S Hochreiter (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems
[b17] J Ho; A Jain; P Abbeel (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems
[b18] J Ho; T Salimans; A Gritsenko; W Chan; M Norouzi; D J Fleet (2022). Video diffusion models. Advances in Neural Information Processing Systems
[b19] E Hoogeboom; V G Satorras; C Vignac; M Welling (2022). Equivariant diffusion for molecule generation in 3d. PMLR
[b20] M F Hutchinson (1990). A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation
[b21] A Jolicoeur-Martineau; K Li; R Piché-Taillefer; T Kachman; I Mitliagkas (2021). Gotta go fast when generating data with score-based models. 
[b22] T Karras; M Aittala; T Aila; S Laine (2022). Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems
[b23] D Kim; C.-H Lai; W.-H Liao; N Murata; Y Takida; T Uesaka; Y He; Y Mitsufuji; S Ermon (2023). Consistency trajectory models: Learning probability flow ode trajectory of diffusion. 
[b24] D Kingma; T Salimans; B Poole; J Ho (2021). Variational diffusion models. Advances in neural information processing systems
[b25] D P Kingma; J Ba;  Adam (2014). A method for stochastic optimization. 
[b26] D P Kingma; M Welling (2013). Auto-encoding variational bayes. 
[b27] A Krizhevsky; G Hinton (2009). Learning multiple layers of features from tiny images. 
[b28] L Li; J He (2024). Bidirectional consistency models. 
[b29] X Li; J Thickstun; I Gulrajani; P S Liang; T B Hashimoto (2022). Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems
[b30] H Liu; Z Chen; Y Yuan; X Mei; X Liu; D Mandic; W Wang; M D Plumbley;  Audioldm (2023). Text-to-audio generation with latent diffusion models. 
[b31] L Liu; Y Ren; Z Lin; Z Zhao (2022). Pseudo numerical methods for diffusion models on manifolds. 
[b32] Z Liu; P Luo; X Wang; X Tang (2015). Deep learning face attributes in the wild. 
[b33] I Loshchilov (2017). Decoupled weight decay regularization. 
[b34] C Lu; Y Zhou; F Bao; J Chen; C Li; J Zhu (2022). Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems
[b35] W Luo; T Hu; S Zhang; J Sun; Z Li; Z Zhang (2024). Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models. Advances in Neural Information Processing Systems
[b36] J Martens; I Sutskever; K Swersky (2012). Estimating the hessian by back-propagating curvature. 
[b37] C Meng; Y Song; W Li; S Ermon (2021). Estimating high order gradients of the data distribution by denoising. Advances in Neural Information Processing Systems
[b38] A Q Nichol; P Dhariwal (2021). Improved denoising diffusion probabilistic models. PMLR
[b39] W Peebles; S Xie (2023). Scalable diffusion models with transformers. 
[b40] B Poole; A Jain; J T Barron; B Mildenhall;  Dreamfusion (2022). Text-to-3d using 2d diffusion. 
[b41] H E Robbins (1992). An empirical bayes approach to statistics. Springer
[b42] R Rombach; A Blattmann; D Lorenz; P Esser; B Ommer (2022). High-resolution image synthesis with latent diffusion models. 
[b43] F Rozet; G Andry; F Lanusse; G Louppe (2024). Learning diffusion priors from observations by expectation maximization. 
[b44] M S Sajjadi; O Bachem; M Lucic; O Bousquet; S Gelly (2018). Assessing generative models via precision and recall. Advances in neural information processing systems
[b45] T Salimans; J Ho (2022). Progressive distillation for fast sampling of diffusion models. 
[b46] T Salimans; T Mensink; J Heek; E Hoogeboom (2024). Multistep distillation of diffusion models via moment matching. 
[b47] J Sohl-Dickstein; E Weiss; N Maheswaranathan; S Ganguli (2015). Deep unsupervised learning using nonequilibrium thermodynamics. PMLR
[b48] J Song; C Meng; S Ermon (2020). Denoising diffusion implicit models. 
[b49] Y Song; S Ermon (2019). Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems
[b50] Y Song; J Sohl-Dickstein; D P Kingma; A Kumar; S Ermon; B Poole (2021). Score-based generative modeling through stochastic differential equations. 
[b51] Y Song; P Dhariwal; M Chen; I Sutskever (2023). Consistency models. 
[b52] J Townsend; T Bird; D Barber (2019). Practical lossless compression with latent variables using bits back coding. 
[b53] A Vahdat; K Kreis; J Kautz (2021). Score-based generative modeling in latent space. Advances in neural information processing systems
[b54] F Vargas; W Grathwohl; A Doucet (2023). Denoising diffusion samplers. 
[b55] P Vincent (2011). A connection between score matching and denoising autoencoders. Neural computation
[b56] G Wang; Y Jiao; Q Xu; Y Wang; C Yang (2021). Deep generative learning via schrödinger bridge. PMLR
[b57] Z Wang; C Lu; Y Wang; F Bao; C Li; H Su; J Zhu (2024). Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in Neural Information Processing Systems
[b58] Z Xiao; K Kreis; A Vahdat (2021). Tackling the generative learning trilemma with denoising diffusion gans. 
[b59] S Xie; Z Xiao; D P Kingma; T Hou; Y N Wu; K P Murphy; T Salimans; B Poole; R Gao (2024). Em distillation for one-step diffusion models. 
[b60] M Xu; T Geffner; K Kreis; W Nie; Y Xu; J Leskovec; S Ermon; A Vahdat (2024). Energy-based diffusion language models for text generation. 
[b61] F Yu; A Seff; Y Zhang; S Song; T Funkhouser; J Xiao;  Lsun (2015). Construction of a large-scale image dataset using deep learning with humans in the loop. 
[b62] L Yu; T Xie; Y Zhu; T Yang; X Zhang; C Zhang (2024). Hierarchical semi-implicit variational inference with application to diffusion model acceleration. Advances in Neural Information Processing Systems
[b63] F Zhang; J He; L I Midgley; J Antorán; J M Hernández-Lobato (2024). Efficient and unbiased sampling of boltzmann distributions via consistency models. 
[b64] M Zhang; T Bird; R Habib; T Xu; D Barber (2019). Variational f-divergence minimization. 
[b65] M Zhang; P Hayes; T Bird; R Habib; D Barber (2020). Spread divergence. PMLR
[b66] M Zhang; A Zhang; S Mcdonagh (2021). On the out-of-distribution generalization of probabilistic image modelling. Advances in Neural Information Processing Systems
[b67] M Zhang; A Hawkins-Hooker; B Paige; D Barber (2024). Moment matching denoising gibbs sampling. Advances in Neural Information Processing Systems
[b68] M Zhang; J He; W C Chen; Z Ou; J M Hernández-Lobato; B Schölkopf; D Barber (2025). Towards training one-step diffusion models without distillation. 
[b69] M Zhao; H Zhu; C Xiang; K Zheng; C Li; J Zhu (2024). Identifying and solving conditional image leakage in image-to-video diffusion model. 
[b70] M Zhou; H Zheng; Z Wang; M Yin; H Huang (2024). Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation. 

Figures:
Figure fig_0: 2
Type: figure
Caption: Figure 2 :2Figure 2: The results of FID v.s. NLL for different methods with varying numbers of sampling steps on CIFAR10 (CS). Our method consistently achieves the best trade-off between FID and NLL.
Data: 

Figure fig_1: 
Type: figure
Caption: (a) Training Data and Density (b) DDPM MMD v.s. Steps (c) DDIM MMD v.s. Steps
Data: 

Figure fig_2: 1
Type: figure
Caption: Figure 1 :1Figure 1: Comparisons of different covariance estimation methods.Figure (a) demonstrates the training data and the ground truth density. Figures (b) and (c) present the MMD evaluation against the total sampling steps in the DDPM (b) and DDIM (c) settings.
Data: 

Figure fig_3: 
Type: figure
Caption: Figure 1: Comparisons of different covariance estimation methods.Figure (a) demonstrates the training data and the ground truth density. Figures (b) and (c) present the MMD evaluation against the total sampling steps in the DDPM (b) and DDIM (c) settings. The results, shown in Figures 1b and1c, demonstrate that the proposed method outperforms other covariance learning approaches. Additionally, we include two methods utilizing the true diagonal and full covariance, serving as benchmarks for the best achievable performance. Notably, in this case, using the full covariance yields better performance compared to the diagonal covariance. This observation highlights the importance of accurate covariance approximation, suggesting that improved methods for learning the off-diagonal terms could lead to even better results. For further analysis, we present an additional toy experiment in Appendix C.1 to demonstrate the efficacy of our approach.
Data: 

Figure fig_4: 3
Type: figure
Caption: Figure 3 :3Figure 3: Results of DiT training on ImageNet 256x256.We generate samples using 10 timesteps with varying CFG coefficients (see Table12for exact numerical values).
Data: 

Figure fig_5: 4
Type: figure
Caption: Figure 4 :4Figure 4: Comparisons of different covariance estimation methods based on estimation error and sample generation quality. Figures (a) and (b) show the mean square error of the estimated diagonal covariance under the assumptions: (a) access to the true score, and (b) learned score of the data distribution at various noise levels. Figures (c) and (d) present the MMD evaluation against the total sampling steps in the DDPM (c) and DDIM (d) settings. We can find the proposed OCM method can achieve the lowest estimation error and consistently outperform other baseline methods when fewer generation steps are applied.
Data: 

Figure fig_6: 56
Type: figure
Caption: Figure 5 :Figure 6 :56Figure 5: Diagonal covariance estimation visualisation with different Rademacher sample numbers.
Data: 

Figure fig_7: 8
Type: figure
Caption: Figure 8 :8Figure 7: The training data and generated samples of OCM-DPM with minimum steps to achieve an FID around 6. (ref:Table 4)
Data: 

Figure fig_8: 
Type: figure
Caption: Figure 9: Generated samples with different sampling steps on CIFAR10 (CS) (ref Table5).
Data: 

Figure fig_9: 
Type: figure
Caption: Figure 10: Generated samples with different sampling steps on CelebA 64x64 (ref: Table5).
Data: 

Figure fig_10: 
Type: figure
Caption: Figure 11: Generated samples with different sampling steps on ImageNet 64x64 (ref: Table5).
Data: 

Figure tab_0: 1
Type: table
Caption: 
Data: FiD ↓NLL ↓# timesteps51015205101520DDPM, β58.28 34.76 24.02 19.00 203.29 74.95 44.94 32.20DDPM, β254.07 205.31 149.67 109.81 7.33 6.51 6.06 5.77OCM

Figure tab_2: 3
Type: table
Caption: The NLL (bits/dim) ↓ across various datasets using different sampling steps.
Data: CIFAR10 (LS)CIFAR10 (CS)# TIMESTEPS K102550 100 200 1000102550 100 200 1000DDPM, β74.95 24.98 12.01 7.08 5.03 3.73 75.96 24.94 11.96 7.04 4.95 3.60DDPM, β6.99 6.11 5.44 4.86 4.39 3.75 6.51 5.55 4.92 4.41 4.03 3.54A-DDPM5.47 4.79 4.38 4.07 3.84 3.59 5.08 4.45 4.09 3.83 3.64 3.42NPR-DDPM5.40 4.64 4.25 3.98 3.79 3.57 5.03 4.33 3.99 3.76 3.59 3.41SN-DDPM30.79 11.83 7.13 5.24 4.39 3.74 90.85 19.81 9.72 6.72 5.58 4.73OCM-DDPM5.32 4.63 4.25 3.97 3.78 3.57 4.99 4.34 3.99 3.76 3.59 3.41CELEBA 64X64IMAGENET 64X64# TIMESTEPS K102550 100 200 10002550100200 400 4000DDPM, β33.42 13.09 7.14 4.60 3.45 2.71 105.87 46.25 22.02 12.10 7.59 3.89DDPM, β6.67 5.72 4.98 4.31 3.74 2.935.81 5.20 4.70 4.31 4.04 3.65A-DDPM4.54 3.89 3.48 3.16 2.92 2.664.78 4.42 4.15 3.95 3.81 3.61NPR-DDPM4.46 3.78 3.40 3.11 2.89 2.654.66 4.22 3.96 3.80 3.71 3.60SN-DDPM18.09 8.05 5.29 4.05 3.40 2.844.56 4.18 3.95 3.80 3.71 3.63OCM-DDPM4.69 3.86 3.43 3.13 2.90 2.664.45 4.15 3.93 3.79 3.70 3.59IMPROVING SAMPLE QUALITY

Figure tab_3: 4
Type: table
Caption: The least number of timesteps ↓ required to achieve an FID around 6 (along with the corresponding FID). To account for the additional time cost incurred by the covariance prediction network, we multiply the results by the ratio of the time cost per single timestep, reflecting the extra computational overhead (see Appendix B.1 for details). lower than those of the proposed OCM methods. This discrepancy arises partly because DDPM and DDIM model the inverse process with deterministic variance, whereas OCM methods explicitly learn the variance from data, allowing for more accurate density estimation and generating more diverse results. AsPeebles & Xie (2023) report that DiT achieves the best FID performance with CFG = 1.5, we further evaluate performance with different sampling steps under CFG = 1.5, as shown in Table13. The results indicate that our OCM-DPM methods offer a great improvement in sample quality and diversity.
Data: METHODCIFAR10 CELEBA 64X64 LSUN BEDROOM IMAGENET 256X256DDPM90 (6.12)> 200130 (6.06)21 (5.89)DDIM30 (5.85)> 100BEST FID > 611 (5.58)IMPROVED DDPM45 (5.96) MISSING MODEL90 (6.02)22 (6.08)ANALYTIC-DPM25 (5.81)55 (5.98)100 (6.05)MISSING MODELNPR-DPM1.002×23 (5.76) 1.013×50 (6.04) 1.021×90 (6.01)MISSING MODELSN-DPM1.005×17 (5.81) 1.019×22 (5.96) 1.114×92 (6.02)MISSING MODELOCM-DPM (OURS) 1.003×16 (5.83) 1.015×21 (5.94) 1.112×90 (6.04)1.007×10 (5.33)

Figure tab_4: 5
Type: table
Caption: FID score ↓ across various datasets using different sampling steps.
Data: CIFAR10 (LS)CIFAR10 (CS)# TIMESTEPS K102550 100 200 1000102550 100 200 1000DDPM, β44.45 21.83 15.21 10.94 8.23 5.11 34.76 16.18 11.11 8.38 6.66 4.92DDPM, β233.41 125.05 66.28 31.36 12.96 3.04 205.31 84.71 37.35 14.81 5.74 3.34A-DDPM34.26 11.60 7.25 5.40 4.01 4.03 22.94 8.50 5.50 4.45 4.04 4.31NPR-DDPM32.35 10.55 6.18 4.52 3.57 4.10 19.94 7.99 5.31 4.52 4.10 4.27SN-DDPM24.066.91 4.63 3.67 3.31 3.65 16.33 6.05 4.17 3.83 3.72 4.07OCM-DDPM24.949.19 5.95 4.36 3.48 3.98 14.32 5.54 4.10 3.84 3.75 4.18DDIM21.31 10.70 7.74 6.08 5.07 4.13 34.34 16.68 10.48 7.94 6.69 4.89A-DDIM14.005.81 4.04 3.55 3.39 3.74 26.43 9.96 6.02 4.88 4.92 4.66NPR-DDIM13.345.38 3.95 3.53 3.42 3.72 22.81 9.47 6.04 5.02 5.06 4.62SN-DDIM12.194.28 3.39 3.23 3.22 3.65 17.90 7.36 5.16 4.63 4.63 4.51OCM-DDIM10.664.35 3.48 3.27 3.29 3.74 16.70 6.71 4.72 4.30 4.54 4.53CELEBA 64X64IMAGENET 64X64# TIMESTEPS K102550 100 200 10002550 100 200 400 4000DDPM, β36.69 24.46 18.96 14.31 10.48 5.95 29.21 21.71 19.12 17.81 17.48 16.55DDPM, β294.79 115.69 53.39 25.65 9.72 3.16 170.28 83.86 45.04 28.39 21.38 16.38A-DDPM28.99 16.01 11.23 8.08 6.51 5.21 32.56 22.45 18.80 17.16 16.40 16.34NPR-DDPM28.37 15.74 10.89 8.23 7.03 5.33 28.27 20.89 18.06 16.96 16.32 16.38SN-DDPM20.60 12.00 7.88 5.89 5.02 4.42 27.58 20.74 18.04 16.61 16.37 16.22OCM-DDPM21.55 12.71 9.24 6.97 5.92 5.04 28.02 20.81 17.98 16.74 16.32 16.31DDIM20.54 13.45 9.33 6.60 4.96 3.40 26.06 20.10 18.09 17.84 17.74 19.00A-DDIM15.629.22 6.13 4.29 3.46 3.13 25.98 19.23 17.73 17.49 17.44 18.98NPR-DDIM14.988.93 6.04 4.27 3.59 3.15 28.84 19.62 17.63 17.42 17.30 18.91SN-DDIM10.205.48 3.83 3.04 2.85 2.90 28.07 19.38 17.53 17.23 17.23 18.89OCM-DDIM10.285.72 4.42 3.54 3.17 3.03 28.28 19.62 17.71 17.42 17.26 19.02

Figure tab_5: 6
Type: table
Caption: Source of pretrained score prediction networks used in our experiments.
Data: PROVIDED BYCIFAR10 (LS)BAO ET AL. (2022B)CIFAR10 (CS)BAO ET AL. (2022B)CELEBA 64X64SONG ET AL. (2020)IMAGENET 64X64 NICHOL & DHARIWAL (2021)LSUN BEDROOMHO ET AL. (2020)IMAGENET 256X256PEEBLES & XIE (2023)

Figure tab_6: 7
Type: table
Caption: Architecture details of our models.
Data: NN1NN2CIFAR10 (LS)ConvConvCIFAR10 (CS)ConvConvCELEBA 64X64ConvConvIMAGENET 64X64ConvRes+ConvLSUN BEDROOMConvRes+ConvIMAGENET 256X256 AdaLN+Linear AdaLN+Linear

Figure tab_7: 
Type: table
Caption: AdaLN denotes the adaptive layer norm block(Peebles & Xie, 2023), and Linear denotes the linear layer.Cost of Memory and Inference Time. In Table8, we present the number of parameters and inference time for various models. It is evident that the additional memory cost of the diagonal Hessian prediction network is minimal compared to the original diffusion models, which only include the score prediction. Regarding the extra inference cost, it is negligible for CIFAR10, CelebA 64x64, and ImageNet 256x256. The additional time is at most 5.3% on ImageNet 64x64, and 11.2% on LSUN Bedroom, but this is offset by the benefit of requiring fewer sampling steps to achieve an FID around 6, as shown in Table4. Notably, although the model size on ImageNet 256x256 is the largest, it has the shortest inference time because DiT learns the diffusion model in the latent space.
Data: SCORE PREDICTIONSCORE & SN PREDICTIONSCORE & DIAGONAL HESSIANNETWORKNETWORKSPREDICTION NETWORKSCIFAR10 (LS)52.54 M / 44.37 MS52.55 M / 44.61 MS (+0.5%)52.55 M / 44.51 MS (+0.3%)CIFAR10 (CS)52.54 M / 45.13 MS52.55 M / 45.24 MS (+0.2%)52.55 M / 45.23 MS (+0.2%)CELEBA 64X6478.70 M / 67.88 MS78.71 M / 69.15 MS (+1.9%)78.71 M / 68.89 MS (+1.5%)IMAGENET 64X64 121.06 M / 106.58 MS 121.49 M / 112.53 MS (+5.6%) 121.49 M / 112.23 MS (+5.3%)LSUN BEDROOM113.67 M / 692.58 MS 114.04 M / 771.55 MS (+11.4%) 114.04 M / 770.73 MS (+11.2%)IMAGENET 256X256 675.13 M / 22.84 MSMISSING MODEL677.80 M / 23.01 MS (+0.7%)(He et al., 2016),

Figure tab_8: 
Type: table
Caption: NLL) 4.69 3.86 3.43 3.13 2.90 2.66 4.45 4.15 3.93 3.79 3.70 3.59 STD-DDPM (NLL) % 0.01 0.01 0.01 0.00 0.01 0.00 0.01 0.00 0.01 0.00 0.01 0.01 MEAN-DDPM (FID) 21.58 12.65 9.24 6.96 6.00 5.02 28.01 20.85 18.01 16.76 16.35 16.33 STD-DDPM (FID) % 2.24 7.65 2.29 2.32 5.50 1.37 0.77 4.66 6.82 3.28 3.04 3.21 MEAN-DDIM (FID) 10.36 5.69 4.37 3.50 3.11 3.03 28.34 19.68 17.84 17.41 17.36 18.91 STD-DDIM (FID) % 7.40 4.04 5.12 4.88 6.46 1.55 6.59 3.77 9.15 4.08 7.88 5.26
Data: CIFAR10 (LS)CIFAR10 (CS)# TIMESTEPS K102550 100 200 1000102550 100 200 1000MEAN-DDPM (NLL) 5.33 4.63 4.35 3.97 3.78 3.57 4.99 4.34 3.99 3.76 3.59 3.41STD-DDPM (NLL) % 0.04 0.03 0.03 0.04 0.04 0.01 0.02 0.01 0.03 0.03 0.02 0.01MEAN-DDPM (FID) 25.07 9.30 5.90 4.37 3.54 4.00 14.46 5.49 4.09 3.83 3.81 4.22STD-DDPM (FID) %9.05 7.54 3.48 2.77 4.25 2.30 10.40 7.74 2.66 1.32 5.26 3.08MEAN-DDIM (FID) 10.61 4.31 3.48 3.26 3.25 3.75 16.85 6.70 4.73 4.33 4.56 4.59STD-DDIM (FID) %5.17 3.55 3.19 1.87 3.97 0.99 13.22 4.23 2.28 2.94 1.61 5.30CELEBA 64X64IMAGENET 64X64# TIMESTEPS K1025 50 100 200 10002550 100 200 400 4000MEAN-DDPM (

Figure tab_9: 10
Type: table
Caption: Upper bound on the negative log-likelihood (bits/dim) on the ImageNet 64x64 dataset.
Data: # TIMESTEPS K2550 100 200 400 4000I-DDPM18.91 8.46 5.27 4.24 3.86 3.57NPR-DDPM4.66 4.22 3.96 3.80 3.71 3.60SN-DDPM4.56 4.18 3.95 3.80 3.71 3.63OCM-DDPM4.45 4.15 3.93 3.79 3.70 3.59

Figure tab_10: 11
Type: table
Caption: Results of OCM-DDPM on CIFAR10 (CS) with varying numbers of Rademacher Samples. The empirical results, presented in Table11, indicate that a larger M (e.g., M = 3) can give a small improvement in FID. This improvement is likely due to the reduced gradient estimation variance during training with a larger M . However, in most cases, different values of M yield consistent performance. This is practically desirable as, in practice, setting M = 1 allows for efficient training while maintaining strong performance C.4 MEAN AND VARIANCE OF PERFORMANCE
Data: FID ↓NLL (%) ↑K =10 25 50 100 10 25 50 100M = 1 14.32 5.54 4.10 3.84 4.99 4.34 3.99 3.76M = 3 14.18 5.51 4.11 3.82 4.99 4.34 3.99 3.76M = 5 14.17 5.51 4.11 3.82 4.99 4.34 3.99 3.76M = 10 14.16 5.51 4.11 3.82 4.99 4.34 3.99 3.76

Figure tab_11: 13
Type: table
Caption: Table 9 by repeating the evaluation three times with different seeds. 12: Results with varying CFG coefficients using 10 sampling steps on ImageNet 256x256. Results with CFG=1.5 across different sampling steps on ImageNet 256x256. .79 3.55 3.09 3.05 2.97 39.74 49.02 51.79 53.54 54.21 54.34 DDPM, β 210.28 42.09 9.43 3.63 2.91 2.75 16.98 46.76 51.67 55.01 55.50 55.21 I-DDPM 44.96 9.01 3.70 2.48 2.25 2.74 50.32 54.78 56.45 57.88 58.76 54.69 OCM-DDPM 30.55 4.96 3.21 2.71 2.50 2.75 49.39 53.75 54.68 55.85 54.97 54.75 DDIM 9.41 2.70 2.33 2.25 2.23 2.20 49.15 56.41 57.00 57.83 57.81 57.99 OCM-DDIM 11.30 3.26 2.56 2.28 2.23 2.18 54.50 58.33 58.09 58.80 58.16 58.56 C.5 MORE RESULTS ON LATENT DIFFUSION MODELS
Data: FID ↓RECALL (%) ↑CFG =1.51.752.03.04.01.51.752.03.04.0DDPM, β30.4119.2613.53 11.52 14.76 39.74 35.29 32.08 24.93 19.91DDPM, β210.28 182.78 158.16 89.86 58.16 16.98 22.28 25.13 23.07 19.48I-DDPM44.9620.7113.019.76 13.75 50.32 43.75 41.08 31.77 25.49OCM-DDPM30.5517.5711.129.52 13.74 49.39 45.87 42.33 32.13 24.96DDIM9.416.546.12 10.49 14.18 49.15 45.82 41.15 29.27 23.16OCM-DDIM11.306.675.339.20 13.49 54.50 50.32 46.58 34.53 25.67FID ↓RECALL (%) ↑# TIMESTEPS K102550 100 200 250102550100200250DDPM, β30.41 4


Formulas:
Formula formula_0: q(x t |x 0 ) = N ( √ ᾱt x 0 , (1 -ᾱt )I),(2)

Formula formula_1: µ t-1 (x t ; θ) = (x t + β t ∇ xt log p θ (x t ))/ 1 -β t ,(4)

Formula formula_2: q(x t-1 |x t ) ≈ q(x t-1 |x t , x 0 )p θ (x 0 |x t )dx 0 ,(5)

Formula formula_3: µ t-1 = √ ᾱt-1 x 0 + 1-ᾱ t-1 -σ 2 t-1 (x t - √ ᾱt x 0 )/ √ 1 -ᾱt .(6)

Formula formula_4: µ 0 (x t ; θ) = (x t + (1 -ᾱt )∇ xt log p θ (x t ))/ √ ᾱt .(7)

Formula formula_5: Σ(x) = E q(x|x) [x 2 ] -E q(x|x) [x] 2

Formula formula_6: Σ(x) = σ 4 ∇ 2 x log q(x) + σ 2 I /α 2 . (8

Formula formula_7: )

Formula formula_8: to remove the O(D 2 ) storage requirement diag(H(x)) ≈ 1/M M m=1 v m ⊙ H(x)v m ,(9)

Formula formula_9: : Let x t ′ ← √ ᾱt ′ x0 + √ 1 -ᾱt ′ x t -√ ᾱt x 0 √ 1-ᾱt .

Formula formula_10: min ϕ E q(x) ||h ϕ (x)- 1 M M m=1 v m ⊙H(x)v m || 2 2 ,(10)

Formula formula_11: L ocm (ϕ) = E q(x)p(v) ||h ϕ (x) -v ⊙ H(x)v|| 2 2 ,(11)

Formula formula_12: Σ(x; ϕ) = (σ 4 h ϕ (x) + σ 2 I)/α 2 . (12

Formula formula_13: )

Formula formula_14: min ϕ 1 T T t=1 E q(xt,x0)p(v) ∥h ϕ (x t ) -v⊙H t (x t )v∥ 2 2 , (13

Formula formula_15: )

Formula formula_16: q(x t |x t ′ ) = N (x t | √ ᾱt ′ :t x t ′ , (1 -ᾱt ′ :t )I).

Formula formula_17: µ t ′ (x t ; θ) = (x t + (1 -ᾱt ′ :t )∇ xt log p θ (x t ))/ √ ᾱt ′ :t ,(15)

Formula formula_18: Σ t ′ (x t ; ϕ) = ((1-ᾱt ′ :t ) 2 h ϕ (x t )+(1-ᾱt ′ :t )I)/ᾱ t ′ :t .(16)

Formula formula_19: x t ′ = µ t ′ (x t ; θ) + ϵΣ 1/2 t ′ (x t ; ϕ). Skip-

Formula formula_20: x t ′ = √ ᾱt ′ x 0 + √ 1 -ᾱt ′ / √ 1 -ᾱt • (x t - √ ᾱt x 0 ),(17)

Formula formula_21: x 0 ∼ p θ,ϕ (x 0 |x t ) = N (µ 0 (x t ; θ), Σ 0 (x t ; ϕ)) and µ 0 (x t ; θ) = (x t + (1 -ᾱt )∇ xt log p θ (x t ))/ √ ᾱt ,(18)

Formula formula_22: Σ 0 (x t ; ϕ) = ((1 -ᾱt ) 2 h ϕ (x t , t) + (1 -ᾱt )I)/ᾱ t .(19)

Formula formula_23: s θ (x t ) = NN 1 (BaseNet(x t , t; θ 1 ); θ 2 ), h ϕ (x t ) = NN 2 (BaseNet(x t , t; θ 1 ); ϕ)(20)

Formula formula_24: x t ′ |x t ), which is Σ t ′ (x t ) = (1 -ᾱt /ᾱ t ′ )I, when t ′ = t -1, we have Σ t-1 (x t ) = β t . The β-DDPM uses the covariance of p(x t ′ |x 0 , x t ), which is (1-ᾱt ′ ) (1-ᾱt) (1 -ᾱt ′ :t ).

Formula formula_25: σ 2 t ′ = 1-ᾱt ′ :t ᾱt ′ :t - (1-ᾱt ′ :t ) 2 dᾱ t ′ :t E q(xt) ∥∇ xt log p θ (x t )∥ 2 2 ,

Formula formula_26: Σ t ′ (x t ; ψ) = exp(v ψ (x t ) log β t + (1 -v ψ (x t )) log βt ),(21)

Formula formula_27: β t → 1 -ᾱt ′ :t , βt → (1-ᾱt ′ ) (1-ᾱt) (1 -ᾱt ′ :t ).

Formula formula_28: ϵ t = (x t - √ ᾱt x 0 )/ √ 1 -ᾱt : min ψ E t,q(x0,xt) ∥ϵ 2 t -g ψ (x t )∥ 2 2 . (22

Formula formula_29: )

Formula formula_30: Σ t ′ (x t ; ψ) = (1-ᾱt ′ ) (1 -ᾱt ) βt ′ :t I + β 2 t α t g ψ (x t ) 1-ᾱt -∇ xt logp θ (x t ) 2 . (23

Formula formula_31: )

Formula formula_32: Σ(x) = σ 4 ∇ 2 x log q(x) + σ 2 I /α 2 . (8

Formula formula_33: )

Formula formula_34: ∇ 2 x log q(x) = - 1 σ 2 ∇ x (x -αx) q(x|x)q(x) q(x) dx = - 1 σ 2 q(x|x)q(x) q(x) dx - 1 σ 2 (x -αx) ∇ xq(x|x)q(x)q(x) -∇ xq(x)q(x|x)q(x) q 2 (x) dx =⇒ σ 2 ∇ 2 x log q(x) + 1 = -(x -αx) ∇ xq(x|x)q(x) -∇ x log q(x)q(x|x)q(x) q(x) dx = -(x -αx) -1 σ 2 (x -αx)q(x|x)q(x) + 1 σ 2 (x -αE q(x|x) [x])q(x|x)q(x) q(x) dx =⇒ σ 4 ∇ 2 x log q(x) + σ 2 I = (x -αx) 2 -(x -αx)(x -αE q(x|x) [x]) q(x|x) dx = α 2 E q(x|x) [x 2 ] -α 2 E q(x|x) [x] 2 ≡ α 2 Σ(x)

Formula formula_35: min ϕ E q(x) h ϕ (x) -E q(v) [v ⊙ H(x)v] 2 2 (24)

Formula formula_36: E q(x) h ϕ (x) -E q(v) [v ⊙ H(x)v] 2 2 = E q(x) E q(v) [h ϕ (x) -v ⊙ H(x)v] 2 2 ≤ E q(x) E q(v) ∥h ϕ (x) -v ⊙ H(x)v∥ 2 2 = L ocm (ϕ).

Formula formula_37: L ocm (ϕ) = E p(v)q(x) ||h ϕ (x) -v ⊙ H(x)v|| 2 2 = E q(x) ||h ϕ (x)|| 2 2 -2E p(v)q(x) [h ϕ (x) T (v ⊙ H(x)v)] + c = E q(x) ||h ϕ (x)|| 2 2 -2E q(x) [h ϕ (x) T diag(H(x))] + c = E q(x) ||h ϕ (x) -diag(H(x))|| 2 2 + c ′

Formula formula_38: E q(x|x) [x] = 1 √ α (x + β∇ x log q(x)). (25

Formula formula_39: )

Formula formula_40: E q(x|x) [xx T ] = 1 α xx T + βs 1 (x)x T + β xs 1 (x) T + β 2 s 2 (x) + β 2 s 1 (x)s 1 (x) T + βI , (26

Formula formula_41: )

Formula formula_42: q(x|η) = e η T x-ψ(η) q 0 (x),(27)

Formula formula_43: η = √ α β x, q 0 (x) ∝ e -1 2β

Formula formula_44: q(η|x) = e η T x-ψ(η)-λ(x) p(η),(28)

Formula formula_45: (η -∇ xλ(x)) T q(η|x) = 0,(29)

Formula formula_46: η (η -∇ xλ(x)) T q(η|x) = ∇ 2 xλ(x), (30

Formula formula_47: ) which implies that E[ηη T |x] = ∇ 2 xλ(x) + ∇ xλ(x)∇ T x λ(x). By substituting η = √ α β x, ∇ xλ(x) = s 1 (x) + 1

Formula formula_48: Cov q(x|x) [x] = β α (I + β∇ 2 x log q(x)). (31

Formula formula_49: )

Formula formula_50: Cov q(x|x) [x] = β 2 α Cov q(x|x) x - √ αx β = β 2 α E q(x|x) x- √ αx β x- √ αx β T -E q(x|x)

Formula formula_51: √ αx β E q(x|x) x- √ αx β T = β 2 α 1 β 2 E q(x|x) x - √ αx x - √ αx T -s 1 (x)s 1 (x) T = β 2 α 1 β 2 xx T -2x(x + βs 1 (x)) T + αE q(x|x) [xx T ] -s 1 (x)s 1 (x) T = β 2 α 1 β 2 β 2 s 2 (x) + β 2 s 1 (x)s 1 (x) T + βI -s 1 (x)s 1 (x) T = β 2 α 1 β I + s 2 (x) ≡ β α I + β∇ 2 x log q(x) ,

Formula formula_52: Cov q(x|x) [x] = β 2 α 1 β E q(x|x) [ϵϵ T ] -∇ x log q(x)∇ x log q(x) T ,(32)

Formula formula_53: t-1 (x t ; ϕ * ) = (1-αt) 2 h ϕ * (xt)+(1-αt)I αt and that of SN-DDPM Σ t-1 (x t ; ψ * ) = 1-ᾱt-1 1-ᾱt β t + β 2 t αt g ψ * (xt) 1-ᾱt -∇ xt logp θ (x t ) 2 are identical.

Formula formula_54: h ϕ * (x t ) = diag(∇ 2 xt log q(x t )) and g ψ * (x t ) = E q(x0|xt) [ϵ 2 t ].

Formula formula_55: Cov q(xt-1|xt) [x t-1 ] = λ 2 t I + γ 2 t Cov q(x0|xt) [x 0 ],(33)

Formula formula_56: λ 2 t 1-ᾱt-1 1-ᾱt β t and γ t = √ ᾱt-1 -1 -ᾱt-1 -λ 2 t ᾱt 1-ᾱt = √ ᾱt-1 βt 1-ᾱt . Since q(x t |x 0 ) = N ( √ ᾱt x 0 , (1 -ᾱt )I), applying Lemma 3 gives diag(Cov q(x0|xt) [x 0 ]) = (1 -ᾱt ) ᾱt (I + (1 -ᾱt )h ϕ * (x t )).

Formula formula_57: diag(Cov q(xt-1|xt) [x t-1 ]) = (1 -α t ) 2 h ϕ * (x t ) + (1 -α t )I α t = Σ t-1 (x t ; ϕ * )

Formula formula_58: diag(Cov q(x0|xt) [x 0 ]) = (1 -ᾱt ) 2 ᾱt g ψ * (x t ) 1 -ᾱt -∇ xt logp θ (x t ) 2 .

Formula formula_59: diag(Cov q(xt-1|xt) [x t-1 ]) = 1 -ᾱt-1 1 -ᾱt β t + β 2 t α t g ψ * (x t ) 1-ᾱt -∇ xt logp θ (x t ) 2 = Σ t-1 (x t ; ψ * ). Therefore, diag(Cov q(xt-1|xt) [x t-1 ]) = Σ t-1 (x t ; ϕ * ) = Σ t-1 (x t ; ψ * ) as desired.

Formula formula_60: s θ (x t ) = ∇ x log p θ (x t ) = -ϵ θ (xt) √ 1-ᾱt
