\section{Proof overview}\label{sec:pf_overview}


Here we give a detailed technical overview for the proof of our main results, Theorems~\ref{thm:pc_over} and~\ref{thm:pc_under}. As in~\cite{Chenetal23diffmodels, CheLeeLiu23ImprovedSGM, leelutan23sgmgeneral}, the three sources of error that we need to keep track of are (1) estimation of the score function; (2) discretization of time when implementing the probability flow ODE and corrector steps; and (3) initialization of the algorithm at $\gamma^d$ instead of the true law of the end of the forward process, $q_0 = q^\rightarrow_T$. It turns out that (1) is not so difficult to manage as soon as we can control (2) and (3). Furthermore, as in prior work, we can easily control (3) via the data-processing inequality: the total variation distance between the output of the algorithm initialized at $q_0$ versus at $\gamma^d$ is at most $\TV(q^\rightarrow_T,\gamma^d)$,
which is exponentially small in $T$ by rapid mixing of the OU process. So henceforth in this overview, let us assume that both the algorithm and the true process are initialized at $q_0$. It remains to control (2). 

\paragraph{Failure of existing approaches.} In the SDE implementation of diffusion models, prior works handled (2) by directly bounding a strictly larger quantity, namely the KL divergence between the laws of the \emph{trajectories} of the algorithm and the true process; by Girsanov's theorem, this has a clean formulation as an integrated difference of drifts. Unfortunately, in the ODE implementation, this KL divergence is infinite: in the absence of stochasticity in the reverse process, these laws over trajectories are not even absolutely continuous with respect to each other.

In search of an alternative approach, one might try a Wasserstein analysis. 
As a first attempt, we could couple the initialization of both processes and look at how the distance between them changes over time. If $(\wh{x}_t)_{0\le t \le T}$ and $(x_t)_{0\le t \le T}$ denote the algorithm and true process, then smoothness of the score function allows us to na\"{\i}vely bound $\partial_t \E[\norm{\wh{x}_t - x_t}^2]$ by $O(L)\E[\norm{\wh{x}_t - x_t}^2]$. While this ensures that the processes are close if run for time $\ll 1/L$, it does not rule out the possibility that they drift apart exponentially quickly after time $1/L$.

\paragraph{Restarting the coupling---first attempt.} What we would like is some way of ``restarting'' this coupling before the processes drift too far apart, to avoid this exponential compounding. We now motivate how to achieve this by giving an argument that is incorrect but nevertheless captures the intuition for our approach.
Namely, let $p_t \deq \law(\wh x_t)$ denote the law of the algorithm, let $\Pode^{t_0,h}$ denote the result of running the ideal probability flow ODE for time $h$ starting from time $t_0$, and let $\Podes^{t_0,h}$ denote the same but for the discretized probability flow ODE with estimated score.
For $h\lesssim 1/L$, consider the law of the two processes at time $2h$, i.e.,
\begin{equation}
    p_{2h} = q_0 \Podesth{0}{2h} \qquad \text{and} \qquad q_{2h} = q_0 \Podeth{0}{2h}\,. \label{eq:twosteps_overview} 
\end{equation}
The discussion above implies that $q_0 \Podeth{0}{h}$ and $q_0 \Podesth{0}{h}$ are close in 2-Wasserstein distance, so by the data-processing inequality, this implies that $q_0 \Podeth{0}{h}\Podesth{h}{h}$ and $q_0\Podesth{0}{h}\Podesth{h}{h}$ are also close. To show that $p_{2h}$ and $q_{2h}$ in Eq.~\eqref{eq:twosteps_overview} are close, it thus suffices to show that $q_0 \Podeth{0}{2h}$ and $q_0\Podeth{0}{h}\Podesth{h}{h}$ are close. But these two distributions are given by running the algorithm and the true process for time $h$, both starting from $q_0 \Podeth{0}{h}$. So if we ``restart'' the coupling by coupling the processes based on their locations at time $h$, rather than time $0$, of the reverse process, we can again apply the na\"{\i}ve Wasserstein analysis.

At this juncture, it would seem that we have miraculously sidestepped the exponential blowup and shown that the expected distance between the processes only increases linearly over time! The issue of course is in the application of the ``data-processing inequality,'' which simply does not hold for the Wasserstein distance.

\paragraph{Restarting the coupling with a corrector step.} This is where the corrector comes in. The idea is to use \emph{short-time regularization}: if we apply a small amount of noise to two distributions which are already close in Wasserstein, then they become close in KL divergence, for which a data-processing inequality holds. The upshot is that if the noise doesn't change the distributions too much, then we can legitimately restart the coupling as above and prove that the distance between the processes, now defined by interleaving the probability flow ODE and its discretization with periodic injections of noise, increases only linearly in time.

It turns out that na\"{\i}ve injection of noise, e.g., convolution with a Gaussian of small variance, is somewhat wasteful as it fails to preserve the true process and leads to poor polynomial dependence in the dimension. On the other hand, if we instead run the overdamped Langevin diffusion with potential chosen so that the law of the true process is stationary, then we can recover the linear in $d$ dependence of Theorem~\ref{thm:pc_over}. Then by replacing overdamped Langevin diffusion with its underdamped counterpart, which has the advantage of much smoother trajectories, we can obtain the desired quadratic speedup in dimension dependence in Theorem~\ref{thm:pc_under}.

\paragraph{Score perturbation lemma.} In addition to the switch from SDE to ODE and the use of the underdamped corrector, a third ingredient is essential to our improved dimension dependence. The former two ensure that the trajectory of our algorithm is smoother than that of DDPMs, so that even over time windows that scale with $1/\sqrt{d}$, the process does not change too much. By extension, as the score functions are Lipschitz, this means that any fixed score function evaluated over iterates in such a window does not change much. This amounts to controlling discretization error in \emph{space}.

It is also necessary to control discretization error in \emph{time}, i.e., proving what some prior works referred to as a \emph{score perturbation lemma}~\cite{leelutan22sgmpoly}. That is, for any fixed \emph{iterate} $x$, we want to show that the score function $\nabla \ln q_t(x)$ does not change too much as $t$ varies over a small window. Unfortunately, prior works were only able to establish this over windows of length $1/d$. In this work, we improve this to windows of length $1/\sqrt{d}$ (see Lemma~\ref{l:sp-ou} and Corollary~\ref{c:sp}). 

In our proof, we bound the squared $L^2$ norm of the derivative of the score along the trajectory of the ODE. The score function evaluated at $y$ can be expressed as
$\E_{P_{0|t}(\cdot|y)}[\nb U]$; here, the posterior distribution $P_{0|t}(\cdot \mid y)$ is essentially the prior $\qdata$ tilted by a Gaussian of variance $O(t)$. Hence we need to bound the change in the expectation when we change the distribution from $P_{0|t}$ to $P_{0|t+\De t}$; because $\nb U$ is $L$-Lipschitz, we can bound this by the Wasserstein distance between the distributions. For small enough $t$, $P_{0|t}$ is strongly log-concave, and a transport cost inequality bounds this in terms of KL divergence, which is more easily bounded. Indeed, we can bound it with the KL divergence between the joint distributions $P_{0,t}$ and $P_{0,t+\De t}$, which reduces to bounding the KL divergence between Gaussians of unequal variance.

\edit{However, since our score perturbation lemma degrades near the beginning of the forward process, we require better control of the discretization error during this part of the algorithm, hence leading to our choice of geometrically decreasing step sizes. Alternatively, we could use a two-stage step size schedule, see Remark~\ref{rmk:two_stage}.}