\section{Preliminaries}\label{sec:prelim}


\subsection{Score-based generative modeling}


Let $\qdata$ denote the data distribution, \emph{i.e.}, the distribution from which we wish to sample.
In score-based generative modeling, we define a forward process ${(\quo_t)}_{t\ge 0}$ with $\quo_0 = \qdata$, which transforms our data distribution into noise.
In this paper, we focus on the canonical choice of the Ornstein{--}Uhlenbeck (OU) process,
\begin{align}\label{eq:forward}
    \D x_t^\rightarrow
    = -x_t^\rightarrow \, \D t + \sqrt 2 \, \D B_t\,, \qquad x_0^\rightarrow \sim \qdata\,, \qquad \quo_t \deq \law(x_t^\rightarrow)\,,
\end{align}
where ${(B_t)}_{t\ge 0}$ is a standard Brownian motion in $\R^d$.
It is well-known that the OU process mixes rapidly (exponentially fast) to its stationary distribution, the standard Gaussian distribution $\gamma^d$.

Once we fix a time horizon $T > 0$, the time reversal of the SDE defined in~\eqref{eq:forward} over $[0,T]$ is given by
\begin{align}\label{eq:reverse_sde}
    \D x_t^\leftarrow
    &= (x_t^\leftarrow + 2\,\nabla \ln q_t^\leftarrow(x_t^\leftarrow)) \, \D t + \sqrt 2 \, \D B_t\,,
\end{align}
where $q_t^\leftarrow \deq q_{T-t}^\rightarrow$, and the reverse SDE is a generative model: when initialized at $x_0^\leftarrow \sim q_0^\leftarrow$, then $x_T^\leftarrow \sim q$.
Since $q_0^\leftarrow = q_T^\rightarrow \approx \gamma^d$, the reverse SDE transforms samples from $\gamma^d$ (i.e., pure noise) into approximate samples from $\qdata$.
In order to implement the reverse SDE, however, one needs to estimate the score functions $\nabla \ln q_t^\leftarrow$ for $t\in [0,T]$ using the technique of score matching~\cite{hyv2005scorematching, vin2011scorematching}.
In practice, the score estimates are produced via a deep neural network, and our main assumption is that these score estimates are accurate in an $L^2$ sense (see Assumption~\ref{ass:score_error}).
This gives rise to the denoising diffusion probabilistic modeling (DDPM) algorithm.

\paragraph*{Notation.} Since the reverse process is the primary object of interest, we drop the arrow $\leftarrow$ from the notation for simplicity; thus, $q_t \deq q_t^\leftarrow$.
We will always denote the forward process with the arrow $\rightarrow$.

For each $t \in [0,T]$, let $s_t$ denote the estimate for the score $\nabla \ln q_t = \nabla \ln q_t^\leftarrow$.


\subsection{Probability flow ODE (predictor steps)}


Instead of running the reverse SDE~\eqref{eq:reverse_sde}, there is in fact an alternative process ${(x_t)}_{t\in [0,T]}$ which evolves according to an ODE (and hence evolves deterministically), and yet has the same marginals as~\eqref{eq:reverse_sde}.
This alternative process, called the \emph{probability flow ODE}, can also be used for generative modeling.

One particularly illuminating way of deriving the probability flow ODE is to invoke the celebrated theorem, due to~\cite{jko}, that the OU process is the Wasserstein gradient flow of the KL divergence functional (i.e. relative entropy) $\KL(\cdot \mmid \gamma^d)$.
From the general theory of Wasserstein gradient flows (see~\cite{ags, san15ot}), the Wasserstein gradient flow ${(\mu_t)}_{t\ge 0}$ of a functional $\cF$ can be implemented via the dynamics
\begin{align*}
    \dot z_t = -[\nabla_{W_2} \cF(\mu_t)](z_t)\,, \qquad z_0 \sim \mu_0\,,
\end{align*}
in that $z_t\sim \mu_t$ for all $t\ge 0$. Applying this to $\cF \deq \KL(\cdot \mmid \gamma^d)$, we arrive at the forward process
\begin{align}\label{eq:forward_ode}
    \dot x_t^\rightarrow
    &= -\nabla \ln\Bigl(\frac{q_t^\rightarrow}{\gamma^d}\Bigr)(x_t^\rightarrow)
    = -x_t^\rightarrow -\nabla \ln q_t^\rightarrow(x_t^\rightarrow)\,.
\end{align}
Setting $x_t \deq x^\rightarrow_{T-t}$, it is easily seen that the time reversal of~\eqref{eq:forward_ode} is
\begin{align}\label{eq:prob_flow_ode}
    \dot x_t
    = x_t + \nabla \ln q_t(x_t)\,, \quad \textit{i.e.,}\quad \dot x_t
    = x_t + \nabla \ln q_{T-t}^\rightarrow(x_t)\,,
\end{align}
which is called the probability flow ODE\@.
In this paper, the interpretation of the probability flow ODE as a reverse Wasserstein gradient flow is only introduced for interpretability, and the reader who is unfamiliar with Wasserstein calculus can take~\eqref{eq:prob_flow_ode} to be the definition of the probability flow ODE\@.
Crucially, it has the property that if $x_0 \sim q_0$, then $x_t \sim q_t$ for all $t\in [0,T]$.

We can discretize the ODE~\eqref{eq:prob_flow_ode}. Fixing a step size $h > 0$, replacing the score function $\nabla \ln q_t$ with the estimated score given by $s_t$, and applying the exponential integrator to the ODE (i.e., exactly integrating the linear part), we arrive at the discretized process
\begin{align}\label{eq:prob_ode_discrete}
    x_{t+h}
    &= x_{t} + \int_0^h x_{t+u} \, \D u + h\,s_t(x_t)
    = \exp(h) \,x_t + (\exp(h)-1)\,s_t(x_t)\,.
\end{align}


\subsection{Corrector steps}\label{sec:diffusion}


Let $q$ be a distribution over $\R^d$, and write $U$ as a shorthand for the potential $-\ln q$.


\paragraph{Overdamped Langevin.}
The \emph{overdamped Langevin diffusion} with potential $U$ is a stochastic process $(x_t)_{t\ge 0}$ over $\R^d$ given by
\begin{equation}
    \D x_t = -\nabla U(x_t) \, \D t + \sqrt{2}\,\D B_t\,.
\end{equation}
The stationary distribution of this diffusion is $q \propto \exp(-U)$.

We also consider the following discretized process where $-\nabla U$ is replaced by a \emph{score estimate} $s$. Fix a step size $h > 0$ and let $(\wh{x}_t)_{t\ge 0}$ over $\R^d$ be given by
\begin{equation}
    \D \wh{x}_t = s(\wh{x}_{\lfloor t/h\rfloor \, h})\, \D t + \sqrt{2}\, \D B_t\,.
\end{equation}

\paragraph{Underdamped Langevin.} 
Given a friction parameter $\fric > 0$, the corresponding \emph{underdamped Langevin diffusion} is a stochastic process $(z_t,v_t)_{t\ge 0}$ over $\R^d\times \R^d$ given by
\begin{align}
     \D z_t &= v_t \,\D t\,, \\
     \D v_t &= -(\nabla U(z_t) + \fric v_t)\,\D t + \sqrt{2\fric}\, \D B_t  \,. \label{eq:nudef}
\end{align}
The stationary distribution of this diffusion is $q \otimes \gamma^d$.

We also consider the following discretized process, where $-\nabla U$ is replaced by a score estimate $s$. Let $(\wh{z}_t,\wh{v}_t)_{t\ge 0}$ over $\R^d\times \R^d$ be given by
\begin{align}\label{eq:discrete_underdamped}
\begin{aligned}
    \D \wh{z}_t &= \wh{v}_t \,\D t\,,\\
    \D \wh{v}_t &= (s(\wh{z}_{\lfloor t/h\rfloor\,h}) -\fric \wh{v}_t)\,\D t + \sqrt{2\fric}\, \D B_t\,.
    \end{aligned}
\end{align}

\paragraph{Diffusions as corrector steps.}
At time $t$, the law of the ideal reverse process~\eqref{eq:prob_flow_ode} initialized at $q_0$ is $q_t$. However, errors are accumulated through the course of the algorithm: the error from initializing at $\gamma^d$ rather than at $q_0$; errors arising from discretization of~\eqref{eq:prob_flow_ode}; and errors in estimating the score function. That's why the law of the algorithm's iterate will not be exactly $q_t$.
We propose to use either the overdamped or the underdamped Langevin diffusion with stationary distribution $q_t$ and estimated score as a corrector, in order to bring the law of the algorithm iterate closer to $q_t$.
In the case of the underdamped Langevin diffusion, this is done by drawing an independent Gaussian random variable $\wh v_0 \sim \gamma^d$, running the system~\eqref{eq:discrete_underdamped} starting from $(\wh z_0, \wh v_0)$ (where $\wh z_0$ is the current algorithm iterate) for some time $t$, and then keeping $\wh z_t$.
In our theoretical analysis, the use of corrector steps boosts the accuracy and efficiency of the SGM\@.