\section{Introduction}


Score-based generative models (SGMs)~\cite{sohletal2015nonequilibrium, sonerm2019estimatinggradients, ho2020denoising, dhanic2021diffusionbeatsgans, songetal2021mlescorebased, songetal2021scorebased, vahkrekau2021scorebased} are a class of generative models which includes prominent image generation systems such as DALL$\cdot$E~2~\cite{ramesh2022hierarchical}.
Their startling empirical success at data generation across a range of application domains has made them a central focus of study in the literature on deep learning~\cite{Ausetal21ddmdiscrete, dhanic2021diffusionbeatsgans, kingetal2021variationaldiffusion, Shietal21gradfields, ChuSimYe22ccdf, Gnaetal22sgmmolecule, Rometal22latentdiff, Songetal22inversesgm, BofVan23probflowfp, WanHunZho23policyclass}.
In this paper, we aim to provide theoretical grounding for such models and thereby elucidate the mechanisms driving their remarkable performance.

Our work follows in the wake of numerous recent works which have provided convergence guarantees for denoising diffusion probabilistic models (DDPMs)~\cite{debetal2021scorebased, BloMroRak22genmodel, DeB22diffusion, leelutan22sgmpoly, liu2022let, Pid22sgm, WibYan22sgm, Chenetal23diffmodels, CheLeeLiu23ImprovedSGM, leelutan23sgmgeneral} and denoising diffusion implicit models (DDIMs)~\cite{CheDarDim23ddim}.
We briefly recall that the generating process for SGMs is the time reversal of a certain diffusion process, and that DDPMs hinge upon implementing the reverse diffusion process as a stochastic differential equation (SDE) whose coefficients are learned via neural network training and the statistical technique of score matching~\cite{hyv2005scorematching, vin2011scorematching} (more detailed background is provided in \S\ref{sec:prelim}).
Among these prior works, the concurrent results of~\cite{Chenetal23diffmodels, leelutan23sgmgeneral} are remarkable because they require minimal assumptions on the data distribution (in particular, they do not assume log-concavity or similarly restrictive conditions) and they hold when the errors incurred during score matching are only bounded in an $L^2$ sense, which is both natural in view of the derivation of score matching (see \cite{hyv2005scorematching, vin2011scorematching}) and far more realistic.\footnote{It is unreasonable, for instance, to assume that the score errors are bounded in an $L^\infty$ sense,
since we cannot hope to learn the score function in regions of the state space which are not well-covered by the training data.}
Subsequently, the work of~\cite{CheLeeLiu23ImprovedSGM} significantly sharpened the analysis in the case when no smoothness assumptions are imposed on the data distribution. 
% \\sitan{Should we also mention the recent work of Benton et al. that settles the non-smooth case?}

Taken together, these works paint an encouraging picture of our understanding of DDPMs which takes into account both the diversity of data in applications (including data distributions which are highly multimodal or supported on lower-dimensional manifolds), as well as the non-convex training process which is not guaranteed to accurately learn the score function uniformly in space.

Besides DDPMs, instead of implementing the time reversed diffusion as an SDE\@, it is also possible to implement it as an ordinary differential equation (ODE), called the \emph{probability flow ODE}~\cite{songetal2021scorebased}; see \S\ref{sec:prelim}. The ODE implementation is often claimed to be faster than the SDE implementation~\cite{Luetal22dpmsolver, ZhaChe23expint}, with the rationale being that ODE discretization is typically more accurate than SDE discretization, so that one could use a larger step size.
Indeed, the discretization error usually depends on the regularity of the trajectories, which is $\mc C^1$ for ODEs but only $\mc C^{\frac{1}{2}-}$ for SDEs (\emph{i.e.}, H\"older continuous with any exponent less than $\frac{1}{2}$) due to the roughness of the Brownian motion driving the evolution.

Far from being able to capture this intuition, current analyses of SGMs cannot even provide a \emph{polynomial-time} analysis of the probability flow ODE\@.
The key issue is that under our minimal assumptions (\emph{i.e.}, without log-concavity of the data distribution), the underlying dynamics of either the ODE or SDE implementation are not contractive, and hence small errors quickly accumulate and are magnified.
The aforementioned analyses of DDPMs managed to overcome this challenge by leveraging techniques specific to the analysis of SDEs, through which we now understand that \emph{stochasticity} plays an important role in alleviating error accumulation.
It is unknown, however, how to carry out the analysis for the purely deterministic dynamics inherent to the probability flow ODE\@.

Our first main contribution is to give the first convergence guarantees for SGMs \edit{with OU forward dynamics} in which steps of the discretized probability flow ODE---referred to as \emph{predictor steps}---are interleaved with \emph{corrector steps} which runs the overdamped Langevin diffusion with estimated score,  as pioneered in~\cite{songetal2021scorebased}.
Our results are akin to prior works on DDPMs in that they hold under minimal assumptions on the data distribution and under $L^2$ bounds on the score estimation error, and our guarantees scale polynomially in all relevant problem parameters.
Here, the corrector steps inject stochasticity which is crucial for our proofs; however, we emphasize that the use of corrector steps does \emph{not} simply reduce the problem to applying existing DDPM analyses.
Instead, we must develop an entirely new framework based on Wasserstein--to--TV regularization, which is of independent interest; see \S\ref{sec:pf_overview} for a detailed overview of our techniques.
Our results naturally raise the question of whether the corrector steps are necessary in practice, and we discuss this further in \S\ref{sec:conclusion}.

When the data distribution is log-smooth, then the dimension dependence of prior results on DDPMs, as well as our first result for the probability flow ODE with overdamped corrector, both scale as $O(d)$. Does this contradict the intuition that ODE discretization is more accurate than SDE discretization?
The answer is \emph{no}; upon inspecting our proof, we see that the discretization error of the probability flow ODE is indeed smaller than what is incurred by DDPMs, and in fact allows for a larger step size of order $1/\sqrt d$.
The bottleneck in our result stems from the use of the overdamped Langevin diffusion for the corrector steps.
Taking inspiration from the literature on log-concave sampling (see, \emph{e.g.},~\cite{chewisamplingbook} for an exposition), our second main contribution is to propose corrector steps based on the \emph{underdamped} Langevin diffusion (see \S\ref{sec:prelim}) which is known to improve the dimension dependence of sampling.
In particular, we show that the probability flow ODE with underdamped Langevin corrector attains $O(\sqrt d)$ dimension dependence.
This dependence is better than what was obtained for DDPMs in~\cite{Chenetal23diffmodels, CheLeeLiu23ImprovedSGM, leelutan23sgmgeneral} and therefore highlights the potential benefits of the ODE framework.
\edit{We note that the benefit to which we refer is at \emph{generation time}, and not at training time.}

Previously,~\cite{jain2022journey} have proposed a ``noise--denoise" sampler using the underdamped Langevin diffusion, but to our knowledge, our work is the first to use it in conjunction with the probability flow ODE.
Although we provide preliminary numerical experiments in the Appendix, we leave it as a question for future work to determine whether the theoretical benefits of the underdamped Langevin corrector are also borne out in practice.


\subsection{Our contributions}


In summary, our contributions are the following.
\begin{itemize}
\item We provide the first convergence guarantees for the probability flow ODE with overdamped Langevin corrector (\DPOM{}; Algorithm~\ref{alg:over}).
\item We propose an algorithm based on the probability flow ODE with underdamped Langevin corrector (\DPUM{}; Algorithm~\ref{alg:under}).
\item We provide the first convergence guarantees for {\DPUM}. These convergence guarantees show improvement over (i) the complexity of {\DPOM} ($O(\sqrt{d})$ vs $O(d)$) and (ii) the complexity of DDPMs, \textit{i.e.}, SDE implementations of score-based generative models (again, $O(\sqrt{d})$ vs $O(d)$).
\item We provide preliminary numerical experiments in a toy example showing that {\DPUM} can sample from a highly non log-concave distribution (see Appendix).  \edit{The numerical experiments are not among our main contributions and are provided for illustration only. The Python code can be found in the Supplementary material.}
\end{itemize}



Our main theorem can be summarized informally as follows; see \S\ref{sec:results} for more detailed statements.

\begin{thm}[Informal]
    Assume that the score function along the forward process is $L$-Lipschitz, and that the data distribution has finite second moment.
    Assume that we have access to $\widetilde O(\varepsilon/\sqrt L)$ $L^2$-accurate score estimates.
    Then, the probability flow ODE implementation of the reversed Ornstein{--}Uhlenbeck process, when interspersed with either the overdamped Langevin corrector (\DPOM{}; Algorithm~\ref{alg:over}) or with the underdamped Langevin corrector (\DPUM{}; Algorithm~\ref{alg:under}), outputs a sample whose law is $\varepsilon$-close in total variation distance to the data distribution, using $\widetilde O(L^3 d/\varepsilon^2)$ or $\widetilde O(L^2 \sqrt d/\varepsilon)$ iterations respectively.
\end{thm}

Our result provides the \emph{first} polynomial-time guarantees for the probability flow ODE implementation of SGMs, so long as it is combined with the use of corrector steps. Moreover, when the corrector steps are based on the underdamped Langevin diffusion, then the dimension dependence of our result is significantly smaller ($O(\sqrt d)$ vs.\ $O(d)$) than prior works on the complexity of DDPMs, and thus provides justification for the use of ODE discretization in practice, compared to SDEs.

Our main assumption on the data is that the score functions along the forward process are Lipschitz continuous, which allows for highly non-log-concave distributions, yet does not cover non-smooth distributions such as distributions supported on lower-dimensional manifolds.
However, as shown in~\cite{Chenetal23diffmodels, CheLeeLiu23ImprovedSGM, leelutan23sgmgeneral}, we can also obtain polynomial-time guarantees without this smoothness assumption via early stopping (see Remark~\ref{rmk:wo_lip_score}).


\subsection{Related works}


The idea of using a time-reversed diffusion for sampling has been fruitfully exploited in the log-concave sampling literature via the \emph{proximal sampler}~\cite{titpap18auxiliary, leeshentian2021rgo, CheEld22localization, liachen22nonsmooth, fanyuanchen23improvedproximal, liache23prox}, as put forth in~\cite{chenetal2022proximalsampler}, as well as through algorithmic stochastic localization~\cite{ElAMonSel22samplingsk, MonWu23spikeddiffusion}.
\edit{Although we do not aim to be comprehensive in our discussion of the literature, we mention, e.g.,~\cite{AlbBofVan23StochInterp, Chenetal23SchrodBridge} for alternative approaches for diffusion models.}
We also note that the recent work of~\cite{CheDarDim23ddim} obtained a discretization analysis for the probability flow ODE (without corrector) in KL divergence, though their bounds have a large dependence on $d$ and are exponential in the Lipschitz constant of the score integrated over time.

\edit{Since the original arXiv submission of this paper, there have been further works studying the probability flow ODE. The work of~\cite{BenDelDou23FlowMatch} also studied the probability flow ODE, but without providing discretization guarantees (and with possibly exponential dependencies). The work~\cite{Lietal23DiffusionModels} provides polynomial-time guarantees for the probability flow ODE (without corrector steps), at the cost of larger polynomial dependencies and more stringent score assumptions (namely, bounds on the Jacobian of the score). Also,~\cite{PedMaaMon23DiffPredCor} study another variant of the predictor-corrector framework.}