Title: PHASE STOCHASTIC BRIDGES: ACCELERATED GENERATIVE MODELING VIA OPTIMAL CONTROL IN PHASE SPACE

Abstract: Generative modeling has seen significant advances, yet efficient sampling, especially with limited computational budgets, remains a critical challenge. This paper introduces Phase Stochastic Bridges (PSB), a novel generative modeling framework that addresses this by operating in phase space, drawing inspiration from Critically Damped Langevin Dynamics (CLD) and Bridge Matching (BM). Leveraging Stochastic Optimal Control (SOC) theory, PSB constructs a more favorable, straighter path measure in phase space, which is highly advantageous for efficient data generation. A distinctive feature of PSB is its early-stage data prediction capability within the context of propagating generative Ordinary Differential Equations (ODEs) or Stochastic Differential Equations (SDEs). This early prediction, enabled by the model's unique structural characteristics, facilitates more efficient data generation by effectively leveraging additional velocity information along the trajectory. Our approach demonstrates comparable results in high-fidelity image generation and notably outperforms baseline methods, particularly when faced with a limited Number of Function Evaluations (NFEs). Furthermore, PSB rivals the performance of diffusion models equipped with efficient sampling techniques, underscoring its significant potential in the realm of accelerated generative modeling.

Section: INTRODUCTION
Generative modeling, particularly with Diffusion Models (DMs; Song et al. (2020a); Ho et al. (2020)), has achieved remarkable success in synthesizing high-fidelity data. DMs operate by formulating a Stochastic Differential Equation (SDE) to gradually diffuse data towards a tractable prior, and then reversing this process using a neural network to approximate the score function for data generation (Anderson, 1982;Haussmann & Pardoux, 1986). While powerful, DMs primarily operate in position space. Critically-damped Langevin Dynamics (CLD; Dockhorn et al. (2021)) extends this framework into phase space by introducing an auxiliary velocity variable, which is defined by tractable Gaussian distributions at the initial and terminal time steps. This augmentation leads to smoother trajectories and enhanced empirical performance and sample efficiency. However, despite these advancements, CLD still suffers from persistent sampling inefficiency due to unnecessary curvature in its dynamics (Fig. 1), as it must converge to equilibrium for sampling from the tractable prior.

The success of DMs has also spurred advancements in alternative generative modeling paradigms, such as Bridge Matching (BM; (Peluchetti, 2021;Liu et al., 2022;2023)) and Flow Matching (FM; Lipman et al. (2022)). These models utilize dynamic transport maps underpinned by SDEs or Ordinary Differential Equations (ODEs) to construct direct bridges between two arbitrary distributions, relaxing the reliance on an asymptotic forward diffusion process. This versatility allows them to draw insights from optimal transport (Pooladian et al., 2023), normalizing flows (Tong et al., 2023b), and optimal control (Liu et al., 2023).

In this paper, we aim to significantly enhance the sample efficiency of velocity-based generative modeling, like CLD, by leveraging Stochastic Optimal Control (SOC) theory. Specifically, we utilize the principles of stochastic bridges within linear momentum systems (Chen & Georgiou, 2015) to construct a more favorable path measure that directly connects the data and prior distributions. This approach yields substantially straighter position and velocity trajectories compared to CLD (Fig. 1), making the dynamics more amenable to efficient sampling. Unlike DM and FM, which rely exclusively on position information for target data estimation, our method re-establishes the property that data points can be represented as linear combinations of scaled intermediate dynamics and Gaussian noise, incorporating both state and velocity information to enhance estimation precision. This allows our model to generate high-fidelity images at remarkably early time steps (Fig. 2) and enables a novel sampling technique that achieves competitive results with a small Number of Function Evaluations (NFEs), e.g., 5 to 10. Table 1 outlines the key design differences among these generative models. In summary, our paper makes the following significant contributions:
1.  We propose Acceleration Generative Modeling (AGM), a novel framework built on SOC theory, which constructs favorable, straighter trajectories for efficient sampling within 2nd-order momentum dynamics, outperforming models like CLD.
2.  A key structural characteristic of AGM is its ability to estimate realistic data points at an early time, a concept we term "sampling-hop." This innovation not only drastically reduces sampling complexity but also offers a fresh perspective on accelerating generative model sampling by effectively leveraging additional velocity information from the dynamics.
3.  We demonstrate competitive results against state-of-the-art DM approaches equipped with specialized fast sampling techniques on image datasets, particularly excelling in low-NFE settings.

Section: PRELIMINARY
Notation: Let x t ∈ R d and v t ∈ R d denote the d-dimensional position and velocity variable of a particle m t = [x t , v t ] T ∈ R 2d at time t. We denote the discretized time series as 0 ≤ t 0 < ...t n < t N < 1. The Wiener Process is denoted as w t . The identity matrix is denoted as I d ∈ R d×d . We define Σ t as the covariance matrix of x t and v t at time step t.

Section: DYNAMICAL GENERATIVE MODELING
The generative modeling approaches rooted in dynamical systems, including ODE and SDE, have garnered significant attention. Here, we present three noteworthy dynamical generative models: Diffusion Model (DM), Flow Matching (FM) and Bridge Matching (BM).
Diffusion Model: In the framework of DM, given x 0 drawn from a data distribution p data , the model proceeds to construct an SDE,
dx t = f t (x t )dt + g(t)dw t x 0 ∼ p data (x)(1)
whose terminal distributions at t = 1 approach an approximate Gaussian, i.e. x 1 ∼ N (0, I d ). This accomplishment is realized through the careful selection of the diffusion coefficient g t and the base drift f t (x t ). It is noteworthy that the time-reversal (Anderson, 1982) of ( 1) results in another SDE:
dx t = f t (x t ) -g 2 t ∇ x log p(x t , t) dt + g(t)dw t , x 1 ∼ N (0, I d )(2)
where p(•, t) is the marginal density of (1) at time t and ∇ x log p t is known as the score function. SDE (2) can be regarded as the time-reversal of (1) in such a manner that the path-wise measure is almost surely equivalent to the one induced by (1). As a consequence, these two SDEs share identical marginal over time. In practice, it is feasible to analytically sample x t given t and x 0 . Additionally, we can leverage a neural network to learn the score function by regressing scaled Stein Score E xt,t ∥s θ t (x t , t; θ)-∇ x log p(x t , t|x 0 )∥ 2 2 for the purpose of propagating ( 2). This learned score can then be integrated into the solution of the aforementioned SDE(2) to simulate the generation of data that adheres to the target data distribution from the prior distribution. Meanwhile, (2) also corresponds to an ODE which shares the same path-wise measure:
dx t = f t (x t ) - 1 2 g 2 t ∇ x log p(x t , t) dt, x 1 ∼ N (0, I d )(3)
which motivates the popular sampler introduced in (Zhang & Chen, 2022;Zhang et al., 2022;Bao et al., 2022) to solve the ODE (2) efficiently.
Bridge Matching and Flow Matching: An alternative approach to exploring the time-reversal of a forward noising process involves the concept of 'building bridges' between two distinct distributions p 0 (•) and p 1 (•). This method entails the learning of a mimicking diffusion process, commonly referred to as bridge matching, as elucidated in previous works (Peluchetti, 2021;Shi et al., 2022).
Here we consider the SDE in the form of:
dx t = v t (x, t)dt + g t dw t s.t. (x 0 , x 1 ) ∼ Π 0,1 (x 0 , x 1 ) := p 0 × p 1 (4)
which is pinned down at an initial and terminal point x 0 , x 1 which are independently samples from predefined p 0 and p 1 . This is commonly known as the reciprocal projection of x 0 and x 1 in the literature (Shi et al., 2023;Peluchetti, 2023;Liu et al., 2022;Léonard et al., 2014). The construction of such SDE is accomplished by meticulous design of v t . A widely adopted choice for v t is v t := (x 1 -x t )/(1 -t), which induces the well-known Brownian Bridge (Liu et al., 2023;Somnath et al., 2023). Similar to the approach in DM and owing to the linear structure of the dynamics, one can efficiently estimate this drift by employing a neural network parameterized by weights θ for regression on: E xt,t ∥v θ t (x t , t; θ) -v t (x t , t)∥ 2 2 given x 1 and t. As extensively discussed in previous studies (Liu et al., 2023;Shi et al., 2022), this bridge matching framework takes on the characteristics of FM (Lipman et al., 2022) when the diffusion coefficient g t tends to zero. Remark 1. The practice of constraining a stochastic process to specific initial and terminal conditions is a well-established setup in SOC. For a gentle introduction of it's connection with Brownian Bridge, Schrödinger Bridge please see Appendix.C. From this perspective, one can derive Brownian Bridge, as elaborated in Appendix.D.1 for comprehensive elucidation. It is imperative to note that the SOC framework will serve as the fundamental basis upon which we will develop our algorithm.

Section: ACCELERATION GENERATIVE MODEL
We apply Stochastic Optimal Control (SOC) to address the suboptimal, 'twisted' trajectories observed in momentum dynamics induced by methods like CLD (Dockhorn et al., 2021). While existing generative models such as Flow Matching, Diffusion Models, and Bridge Matching primarily estimate the target data point, x₁, using only the intermediate state of the dynamics, xₜ, our objective is to significantly expedite this estimation. We achieve this by incorporating additional dynamics-related information, specifically velocity, thereby curtailing the requisite time integration for generating high-fidelity samples.
In this section, we formally introduce our proposed method, the Acceleration Generative Model (AGM), which is deeply rooted in SOC theory. Extending the foundational work of Chen & Georgiou (2015), we generalize the framework by incorporating a time-varying diffusion coefficient and accommodating more flexible boundary conditions. This extension culminates in a novel analytical solution specifically tailored for efficient generative modeling. We rigorously demonstrate AGM's efficacy in rectifying the inherently curved trajectories of CLD, while concurrently highlighting its unique aptitude for accurately estimating the target data at significantly early timesteps (tᵢ), thereby enabling substantially more expeditious sampling.
Drawing inspiration from the Bridge Matching (BM) approach, it is crucial to formulate a trajectory that effectively bridges two distributions, p₀ and p₁. Ideally, this intermediate trajectory should possess optimal characteristics, particularly smoothness and linearity, to facilitate straightforward and efficient simulation of the dynamical system. To address this and further enhance the estimation of the target data point x₁ by explicitly incorporating velocity components, we formalize the problem within a Stochastic Optimal Control (SOC) framework. This framework is specifically formulated in phase space as follows: Definition 2 (Stochastic Bridge problem of linear momentum system (Chen & Georgiou, 2015)).
min at 1 τ ∥a t ∥ 2 2 dt + (m 1 -m 1 ) T R(m 1 -m 1 ) s.t dx t dv t dmt = v t a t (x t , v t , t) f (m,t) dt + 0 0 0 g t gt dw t , m τ := x τ v τ = x τ v τ , R = r 0 0 r ⊗ I d , x 1 ∼ p data .(5)
In this context, the matrix R is recognized as the terminal cost matrix, serving to assess the proximity between the propagated m 1 and the ground truth m 1 at the terminal time t = 1. As the parameter r approaches positive infinity, the trajectory converges toward the state x 1 , prompting a transition to constrained dynamics wherein the system becomes constrained by two predetermined boundaries, namely m 0 and m 1 . This configuration aligns seamlessly with the principles of constructing a feasible bridge, as advocated by the tenets of BM. It is worth noting that this interpolation approach essentially represents a natural extension (Chen & Georgiou, 2015) of the well-established concept of the Brownian Bridge (Revuz & Yor, 2013), which has been employed in trajectory inference (Somnath et al., 2023;Tong et al., 2023a) and image inpainting tasks (Liu et al., 2023) and its connection with Diffusion has been discussed in Liu et al. (2023). Indeed, it is evident that the target velocity lacks a precise definition within this problem, allowing for flexibility in the design space for our approach. To address this, we opt for the linear interpolation of the intermediate point and the target point, represented as v 1 = (x 1 -x t )/(1 -t), as the chosen terminal velocity, which also is the optimal control in the original space (see Appendix..D.1). This choice is made due to its ability to construct a trajectory characterized by straightness. Conceptually, the acceleration a t continually guides the dynamics towards the linear interpolation of the two data points, serving to mitigate the impact of introduced stochasticity. In contrast to previous bridge matching frameworks, the velocity's boundary condition in our approach varies over time since it depends on the state x t and t. The velocity variable serves solely as an auxiliary component aimed at straightening the trajectories. Regarding this SOC problem formulation, the solution is, Proposition 3 (Phase Space Brownian Bridge). When r → +∞, The solution w.r.t optimization problem 5 is,
a * (m t , t) = g 2 t P 11 x 1 -x t 1 -t -v t where : P 11 = -4 g 2 t (t -1) . (6
)
Proof. Please see Appendix.D.2.
Figure 2: Data estimation comparison with EDM (Karras et al., 2022). When the network is endowed with supplementary velocity, AGM gains the capacity to estimate the target data point during the early stages of the trajectory. One can use estimated image x1 at t i < t N as generated results and allocated more NFE between time [0, t i ] which results to smaller discretization error.
Remark 4. P 11 denotes the second diagonal component in the matrix P t , a solution derived from the Lyapunov equation (see Lemma.9), serving as an implicit representation of the optimality of the control. This value is dependent upon the uncontrolled dynamics, where a t is set to the zero vector in (5), and will vary accordingly when uncontrolled dynamics change.

Section: TRAINING
By substituting the optimal control (6) back into the dynamics (5), we obtain the governing SDE.
As suggested by (Song et al., 2020b;Dockhorn et al., 2021), this SDE possesses a corresponding probabilistic ODE that shares the same marginal over time, where the drift term includes an additional score term ∇ᵥ log p(mₜ, t). We summarize the force terms for both the SDE and ODE formulations as follows:
dx t dv t = v t F t dt + 0 0 0 h t dw t s.t m 0 := x 0 v 0 ∼ N (µ 0 , Σ 0 ), Bridge Matching SDE : F t := F b t (m t , t) ≡ a * t (m t , t), h(t) := g(t), Probablistic ODE : F t := F p t (m t , t) ≡ a * t (m t , t) - 1 2 g 2 t ∇ v log p(m, t), h(t) := 0.(7)
Henceforth, we refer to the dynamics associated with the Bridge Matching SDE as AGM-SDE, and its corresponding ODE counterpart as AGM-ODE. Given the linearity of the system, both the intermediate state mₜ and the closed-form solution of the score term are analytically available. Specifically, the mean µₜ and covariance matrix Σₜ of the intermediate marginal distribution
p t (m t |x 1 ) = N (µ t , Σ t )
of such a system can be analytically computed as
Σ t = Σ xx t Σ xv t Σ xv t Σ vv t ⊗ I d , and µ t = µ x t µ v t
, provided we define the boundary conditions µ₀ and Σ₀, as outlined in Särkkä & Solin (2019). For further details, please refer to Appendix.D.3. To sample from such a multivariate Gaussian, we perform a Cholesky decomposition of the covariance matrix, and mₜ is reparameterized as:
m t = µ t + L t ϵ = µ t + L xx t ϵ 0 L xv t ϵ 0 + L vv t ϵ 1 , ∇ v log p t := -ℓ t ϵ 1 (8)
where
Σ t = L t L T t ,ϵ = ϵ 0 ϵ 1 ∼ N (0, I 2d ) and ℓ t = Σ xx t Σ xx t Σ vv t -(Σ xv t ) 2 .
Parameterization: The force term can be linearly represented as a combination of the data point and Gaussian noise. Specifically, the optimal acceleration a*(mₜ, t) is given by:
a * (m t , t) = 4x 1 (1 -t) 2 -g 2 t P 11 L xx t 1 -t + L xv t ϵ 0 + L vv t ϵ 1 .(9)
We parameterize the neural network's output for the force term as
F θ t = s θ t • z t .
Here, zₜ acts as a normalization factor, scaling the output of the network sᵪ(mₜ, t; θ) to ensure that the variance of the network output is normalized to unity. For the detailed formulation of the normalizer zₜ, please refer to Appendix.D.8. Following a similar approach to Bridge Matching (BM), the objective function for regressing the force term is formulated as:
min θ E t∈[0,1] E x1∼p data E mt∼pt(mt|x1) λ(t) ∥F θ t (m t , t; θ) -F t (m t , t)∥ 2 2 (10)
where λ(t) is a reweighting function for the objective across the time horizon. We defer the derivation of ℓₜ and the full presentation of Lₜ, λ(t), and aₜ to Appendix.D.

Section: SAMPLING FROM AGM
Once the paramterized force term F θ t is trained, we are ready to simulate the dynamics to generate the samples by plugging it back to the dynamics (7). One can use any type of SDE or ODE sampler to propagate the learnt system. Here we list our choice of sampler for AGM-SDE and AGM-ODE.
Stochastic Sampler: To simulate the SDE, prior works are majorly relying on Euler-Maruyama(EM) (Kloeden et al., 1992) and related methods. We adopt the Symmetric Splitting Sampler(SSS) from Dockhorn et al. (2021) in our AGM-SDE. This selection is based on the compelling performance it offers when dealing with momentum systems.
Deterministic Sampler: It is imperative to acknowledge that this system is inherently underactuated because the force term is exclusively injected into the velocity component, while velocity serves as the driving factor for the position-a variable of primary interest in generative modeling context. More specifically, at time step t i , the impact of force does not immediately manifest in the position but rather takes effect at a subsequent time step, denoted as t i+1 after discretizing time horizon. At time t 0 , it becomes undesirable to propagate the state x 0 using an initially uncontrolled velocity over an extended time interval δ 0 . The presence of this delay phenomenon can also exert an influence when the time interval δ t is large, thereby impeding our ability to reduce the NFE during sampling. We propose the adoption of an Exponential Integrator (EI) approach, as elaborated in Zhang & Chen (2022). Empirical evidence suggests that this method aligns well with our model. We provide an illustrative example of how the AGM-ODE, in conjunction with the EI technique, can be employed to inject the learnt network into both velocity and position channels simultaneously:
x ti+1 v ti+1 = Φ(t i+1 , t i ) x t v t + w j=0 ti+1 ti (t i+1 -τ ) z τ • M i,j (τ )dτ • s θ t (m ti-j , t i-j )) ti+1 ti z τ • M i,j (τ )dτ • s θ t (m ti-j , t i-j ) Where M i,j (τ ) = k̸ =j τ -t i-k t i-j -t i-k
, and
Φ(t, s) = 1 t -s 0 1 .(11)
In Eq.11, Φ(s, t) denotes the transition matrix for our system, while M i,j (τ ) represents the w-order multistep coefficient (Hochbruck & Ostermann, 2010). For a comprehensive derivation of these terms, please refer to Appendix.D.9. It is worth noting that the mapping of s θ into both the position and velocity channels significantly emulates the errors introduced by discretization delays. Sampling-hop: In the context of CLD (Dockhorn et al., 2021), their focus is on estimating the score function w.r.t. velocity, which essentially corresponds to estimating scaled ϵ 1 in our notation. However, relying solely on the aforementioned information is not sufficient for estimating the data point x 1 . Additional knowledge regarding ϵ 0 is also required in order to perform such estimation. In our case, the training objective implicitly includes both ϵ 0 and ϵ 1 (see eq.9), hence one can manage to recover x 1 by Proposition.5. Remarkably, our observations have unveiled that when the network is equipped with additional velocity information, it acquires the capability to estimate the target data point during the early stages of the trajectory, as illustrated in fig. 2. This estimation can be seamlessly integrated into AGM-SDE and AGM-ODE and we name it sampling-hop. Specifically, Proposition 5 (Sampling-Hop). Given the state, velocity and trained force term F θ t at time step t in sampling phase, The estimated data point x1 can be represented as
xSDE 1 = (1 -t)(F θ t + v t ) g 2 t P 11 + x t , or xODE 1 = F θ t + g 2 t P 11 (α t x t + β t v t ) 4(t -1) 2 + g 2 t P 11 (α t µ x t + β t µ v t )(12)
for AGM-SDE and AGM-ODE dynamics respectively, and Sample m t = µ t + L t ϵ.(eq.8)
β t = L vv t + 1 2P11 ,α t = ( L xx t 1-t +L xv t )-
6:
Compute target F t (eq.7) using optimal acceleration (eq.9)
7:
Compute loss E λ∥F θ t -F t ∥ 2 2 (eq.10).
8:
Take gradient descent with respect to F θ t (m t , t; θ). 9: end while Algorithm 2 Sampling 

Section: EXPERIMENTAL RESULTS
Figure 3: The standard deviaton σ of the terminal marginal for uncontrolled dynamics. We empirically selected the hyperparameter k = -0.2. This choice induces a terminal marginal distribution with σ that covers the data range with uncontrolled dynamics.
Architectures and Hyperparameters: We parameterize s θ t (•, •; θ) using modified NCSN++ model as provided in Karras et al. (2022). We employ six input channels, accounting for both position and velocity variables, as opposed to the standard three channels used in the CIFAR-10 ( Krizhevsky et al., 2009), AFHQv2 (Choi et al., 2020) and ImageNet (Deng et al., 2009) which leads to a negligible increase of network parameters. For the purpose of comparison with CLD in the toy dataset, we adopt the same ResNet-based architecture utilized in CLD. Throughout all of our experiments, we maintain a monotonically decreasing diffusion coefficient, given by g(t) = 3(1 -t). For the detailed experimental setup, please refer further to Appendix.E.
Evaluation: To assess the performance and the sampling speed of various algorithms, we employ the Fréchet Inception Distance score (FID; Heusel et al. (2017)) and the Number of Function Evaluations (NFE) as our metrics. For FID evaluation, we utilize reference statistics of all datasets obtained from EDM (Karras et al., 2022) and use 50k generated samples to evaluate. Additionally, we reevaluate the FID of CLD and EDM using the same reference statistics to ensure consistency in our comparisons. For all other reported values, we directly source them from respective referenced papers.
Selection of Σ 0 : The choice of initial covariance Σ 0 directly influences the path measure of the trajectory. In our case, we set Σ 0 := 1 k k 1 with hyperparameter k. We observe that trajectories tend to exhibit pronounced curvature under specific conditions: when  the k is positive, the absolute value of the position is large. This behavior is particularly noticeable when dealing with images, where the data scale ranges from -1 to 1. We aim for favorable uncontrolled dynamics, as this can potentially lead to better-controlled dynamics. Our strategy is to design k in such a way that the marginal distribution of uncontrolled dynamics at t N = 1 effectively covers the range of image data values meanwhile k keeps negative. We can express the marginal of uncontrolled dynamics by leveraging the transition matrix Φ(1, 0), which gives us x 1 := x 0 + v 0 . Figure 3 illustrates the standard deviation of x 1 for various values of k. Based on our empirical observations, we choose k = -0.2 for all experiments, as it effectively covers the data range. The subsequent controlled dynamics (eq.7) will be constructed based on such desired uncontrolled dynamics as established.  We underscore the effectiveness of sampling-hop, especially when faced with a constrained NFE budget, in comparison to baselines. We validate it on the CIFAR-10 and AFHQv2 dataset respectively. Fig. 4 illustrates that AGM-ODE is able to generate plausible images even when NFE= 5 and outperforms EDM (Karras et al., 2022) when NFE is extremely small (NFE<15) visually and numerically on AFHQv2 dataset. We also compare with other fast sampling algorithms built upon DM in table.5 on CIFAR-10 dataset where AGM-ODE demonstrates competitive performance. Notably, AGM-ODE outperforms the baseline CLD with the same EI sampler by a large margin. We suspect that the improvement is based on the rectified trajectory which is more friendly for the ODE solver.
Conditional Generation We showcase the capability of AGM to generate conditional samples using an unconditional model (fig. 5) by incorporating conditional information into the prior velocity variable v 0 . Instead of employing a randomly sampled v 0 , we use a linear combination of v 0 and the desired velocity v 1 = (x 1 -x t0 )/(1 -t 0 ), where x 1 is conditioned data. Thus, t 0 , the initial velocity is defined as v cond 0 := (1 -ξ)v 0 + ξv 1 , with ξ serving as a mixing coefficient. Fig. 5 shows that AGM can generate conditional data without augmentation and additional fine-tuning. Such property can be extended to the inpainting task as well and the detail can be found in appendix.F.

Section: CONCLUSION AND LIMITATION
In this paper, we introduce a novel Acceleration Generative Modeling (AGM) framework rooted in SOC theory. Within this framework, we devise more favorable, straight trajectories for the momentum system. Leveraging the intrinsic characteristics of the momentum system, we capitalize on additional velocity to expedite the sampling process by using the sampling-hop technique, significantly reducing the time required to converge to accurate predictions of realistic data points. Our experimental results, conducted on both toy and image datasets in unconditional generative tasks, demonstrate promising outcomes for fast sampling.
However, it is essential to acknowledge that our approach's performance lags behind state-of-the-art methods in scenarios with sufficient NFE. This observation suggests avenues for enhancing AGM performance. Such improvements could be achieved by enhancing the training quality through the adoption of techniques proposed in Karras et al. (2022) including data augmentation, fine-tuned noise scheduling, and network preconditioning, among others.

Section: C.2 VALUE FUNCTION, HAMILTON-JACOBIAN (HAMILTON-JACOBI-BELLMAN EQUATION) AND RICATTI EQUATION
We adopt the classical notation in the SOC for the value function. Specifically, the underscript of the value function V represents for the partial derivative of it. For example, V t , V x and V xx represent for the first order derivative of V w.r.t time t , state x and second order derivate of V w.r.t x. We first define the value function as:
V (x t , t) = inf u E 1 t 1 2 ∥u t ∥ 2 2 dτ + x T 1 Rx 1
and the dynamics is,
dx t = (Ax t + g t u t )dt + g t dw t
From Bellman's principle to the value function, one can get:
V (t, x t ) = inf u E V (t + dt, x t+dt ) + t+dt t 1 2 ∥u t ∥ 2 2 dτ = inf u E 1 2 ∥u t ∥ 2 2 dt + V (t, x t ) + V t (t, x t )dt + V x (t, x)dx + 1 2 tr V xx gg T dt = Plug in the dynamics dx t = • • • = inf u E 1 2 ∥u t ∥ 2 2 dt + V (t, x t ) + V t (t, x t )dt + V x (t, x) T ((Ax t + g t u t )dt + gdw t ) + 1 2 tr V xx gg T dt = inf u 1 2 ∥u t ∥ 2 2 dt + V (t, x t ) + V t (t, x t )dt + V x (t, x) T (Ax t + g t u t )dt + 1 2 tr V xx gg T dt
One obtain:
V t + inf u 1 2 ∥u t ∥ 2 2 + V T x (Ax t + g t u t ) + 1 2 tr V xx gg T = 0
The optimal control can be obtained by
u * t = -g t V x
Plugging it back, one can obtain the HJB PDE:
V t - 1 2 V x gg T V x + V T x Ax t + 1 2 tr V xx gg T = 0
We assume that there exist certain matrix Q, s.t. V (x, t) ≡ 1 2 x T Qx + Ξ(t). By matching the different power term of HJB, one can write:
-Ξ - 1 2 x T Qx = - 1 2 x T Qgg T Qx T + x T A T Qx + 1 2 tr Qgg T (14)
with boundary condition:
Ξ(1) = 0, Q(1) = R (15)
Due to the fact that x T A T Qx = x T QAx, one arrives Riccati Equation:
-Q = A T Q + QA -Qgg T Q (16)
Recall that the optimal solution is u * t = -g t V x and V := 1 2 x T Qx + Ξ(t), the optimal control can be expressed in the way of the solution of Ricatti equation:
u * t = -g T Q(t)x t .

Section: C.3 RICATTI EQUATION AND LYAPUNOV EQUATION
Here we provide the connection between Ricatti Equation and Lyapunov Equation in the current setup. Lemma 6. Define P (t) := Q(t) -1 in which Q(t) is the solution of Ricatti equation (eq.16), Then P (t) solve the Lyapunov equation:
Ṗ = AP + P A T -gg T (17)
For notation consistency, we name the elements in P matrix as, P = P 00 P 01 P 10 P 11
Proof. By plugging in the Lyapunov equation P (t) := Q(t) -1 , one can get:
Q -1 = AQ -1 + Q -1 A T -gg T ⇔ -Q -1 QQ -1 = AQ -1 + Q -1 A T -gg T ⇔ -Q = QA + A T Q -Qgg T Q
By Lemma.6, the optimal control can also be represented as the solution of the Lyapunov equation:
u * t = -g T P (t) -1
x t which is indeed the optimal control term used in Chen & Georgiou (2015) after adopting their notation, and it is same as the optimal control term we used in the Lemma.12 without base dynamics compensation.

Section: C.4 SOC CONNECTION WITH SCHR ÖDINGER BRIDGE
The optimal control solution is also the solution of Schrödinger Bridge when the terminal condition degenerate to the point mass (see example of Brownian Bridge in Appendix.D.1). It is also the solution of the Schrödinger Bridge when the optimal pairing is available see proposition.  We adopt the presentation form Kappen (2008). We consider the control problem:
min ut 1 t 1 2 ∥u t ∥ 2 2 dt + r 2 ∥x 1 -x 1 ∥ 2 2 s.t. dx t = u t dt, x 0 = x 0
Where r is the terminal cost coefficient. According to Pontryagin Maximum Principle (PMP;Kirk ( 2004)) recipe, one can construct the Hamiltonian:
H(t, x, u, γ) = - 1 2 ∥u t ∥ 2 2 + γu t
By setting:
∂H ∂u t = 0,
the optimized Hamiltonian is:
H(t, x, u, γ) * = 1 2 γ 2 , where u t = γ
Then we solve the Hamiltonian equation of motion:
dx t dt = ∂H * ∂γ = γ dγ dt = ∂H * ∂x = 0 where x 0 = x 0 and γ 1 = -r • (x 1 -x 1 )
One can notice that the solution for γ t is the constant
γ t = γ = -r • (x 1 -x 1 ), hence the solution for x t is x t = x 1 + γt. γ = -r(x 1 -x 1 ) = -r(x 0 + (1 -t)γ -x 1 ) → u * t := γ = r(x 1 -x 0 ) 1 + r(1 -t)
When r → +∞, we arrive the optimal control as u * t = x1-x0 1-t . Due to certainty equivalence, this is also the optimal control law for dx t = u t dt + dw t By plugging it back into the dynamics, we obtain the well-known Brownian Bridge:
dx t = x 1 -x t 1 -t dt + dw t Remark 7.
If there is not stochasticity dw t , one can get u t := x1-xt 1-t = x 1 -x 0 which is the vector field constructed by Lipman et al. (2022) during traning.

Section: D.2 PROOF OF PROPOSITION.3
Proposition 8. The solution of the stochastic bridge problem of linear momentum system (Chen & Georgiou, 2015) is
a * (m t , t) = g 2 t P 11 x 1 -x t 1 -t -v t where : P 11 = -4 g 2 t (t -1) . (18
)
Proof. From Lemma.12, one can get the optimal control for this problem is
u * t = -gg T P -1 t (m t -Φ(t, 1)m 1 )
where state transition function Φ can be obtained from Lemma.11 and P t is the solution of Lyapunov equation and P -1 t can be found in Lemma.9. Then we have:
u * t = -gg T P -1 t (m t -Φ(t, 1)m 1 ) = -gg T P -1 t m t + gg T P -1 t Φ(t, 1)m 1 = - 0 0 0 g 2 P -1 t m t + gg T P -1 t 1 t -1 0 1 m 1 = -g 2 t 0 0 P 10 P 11 m t + 0 0 0 g 2 t P 00 P 01 P 10 P 11 1 t -1 0 1 m 1 = -g 2 t 0 0 P 10 P 11 m t + g 2 t 0 0 P 10 P 11 1 t -1 0 1 m 1 = -g 2 t 0 0 P 10 P 11 m t + g 2 t 0 0 P 10 P 10 (t -1) + P 11 m 1 = 0 g 2 t P 10 (x 1 -x t ) + g 2 t P 10 (t -1) • v 1 + g 2 t P 11 (v 1 -v t ) Plug in v 1 := x 1 -x t 1 -t = 0 g 2 t P 11 x1-xt 1-t -v t
Lemma 9. The Lyapunov equation corresponding to the optimization problem showed in Lemma.12:
u * t ∈ arg min ut∈U E T 0 1 2 ∥u t ∥ 2 dt + x T 1 Rx 1 s.t dm t = 0 1 0 0 A m t dt + u t dt + gdw t m 0 = m 0 , m 1 = m 1 is depited as Ṗ = AP + PA T -gg T . (19
)
When g = 0 g , the solution for Lyapunov equation above, with terminal condition
P 1 = R -1 = lim r→inf r 0 0 r -1 = 0 0 0 0(20)
However, one does not need the force to converge exactly at v 1 because we only care about the generate quality of x 1 . Here we give a general case in which the r keeps a small value ω for the velocity channel:
P 1 = R -1 = 0 0 0 ω (21)
Then the solution is given by
P t = ω(t -1) 2 -1 3 g 2 (t -1) 3 ω(t -1) -1 2 g 2 (t -1) 2 ω(t -1) -1 2 g 2 (t -1) 2 g 2 (1 -t)
+ ω and the inverse of P t is,
P -1 t = 1 g 2 (-4ω + g 2 (t -1))(t -1) 12(ω-g 2 (t-1)) (t-1) 2 6(-2ω+g 2 (t-1)) t-1 6(-2ω+g 2 (-1+t)) t-1 12ω -4g 2 (t -1)
Thus,
P 10 = -12ω + 6g 2 (t -1) g 2 [-4ω + g 2 (t -1)](t -1) 2 = -12ω g 2 [-4ω + g 2 (t -1)](t -1) 2 + 6 [-4ω + g 2 (t -1)](t -1) P 11 = 12ω -4g 2 (t -1) g 2 [-4ω + g 2 (t -1)](t -1) = 12ω g 2 [-4ω + g 2 (t -1)](t -1) + -4 [-4ω + g 2 (t -1)]
Proof. One can plug in the solution of P t into the Lyapunov equation P t and it validates P t is indeed the solution.
Remark 10. Here we provide a general form when the terminal condition of the Lyapunov function is not a zero matrix. It explicitly means that it allows that the velocity does not necessarily need to converge to the exact predefined v 1 . It will have the same results as shown in the paper by setting ω = 0.
Lemma 11. The state transition function Φ(t, s) of following dynamics,
dm t = 0 1 0 0 m t dt is, Φ(t, s) = 1 t -s 0 1
Proof. One can easily verify that such Φ satisfies ∂Φ/∂t = 0 1 0 0 Φ.
Lemma 12 (Chen & Georgiou (2015)). When R → ∞, The optimal control u * t of following problem,
u * t = 0 a t ∈ arg min ut∈U T 0 1 2 ∥u t ∥ 2 dt + x T 1 Rx 1 s.t dm t = 0 1 0 0 m t dt + u t dt + g t dw t
m 0 = m 0 is given by u * t = -gg T P -1 t (m t -Φ(t, 1)m 1 ) Where P t follows Lyapunov equation (eq.19) with boundary condition P 1 = 0. and function Φ(t, s) is the transition matrix from time-step s to time-step t given uncontrolled dynamics.
And it is indeed the stochastic bridge of following system:
dm t = 0 1 0 0 m t dt + u t dt + gdw t (22) m 0 = m 0 , m 1 = m 1 (23)
Proof. See page 8 in Chen & Georgiou (2015).

Section: D.3 MEAN AND COVARIANCE OF SDE
By plugging the optimal control into the system, one can obtain the system as:
dm t = v t F t dt + g t dw t = v t g 2 t P 11 x1-xt 1-t -v t dt + g t dw t = 0 1 - g 2 t P11 1-t -g 2 t P 11 Ft x t v t dt + 0 g 2 t P11 1-t x 1 Dt dt + g t dw t
We follow the recipe of Särkkä & Solin (2019). The mean µ t and variance Σ t of the matrix of random variable m t obey the following respective ordinary differential equations (ODEs):
dµ t = Ft µ t dt + Dt dt dΣ t = Ft Σ t dt + Ft Σ t T dt + gg T dt
One can solve it by numerically simulating two ODEs whose dimension is just two. Or one use software such as Inc. ( 2022) to get analytic solutions. If you opt to the later approach, you can get:
µ x t = 1 3 x 1 t 2 (t 2 -4t + 6) µ v t = 4tx 1 3 (t 2 -3t + 3) Σ xx t = - 1 9 (-1 + t) 2 [-9 + 2(-1 + k)t (3 + (-3 + t)t) (3 + t [3 + (-3 + t)t])] Σ xv t = 1 9 {(-1 + t) [t (3 + (-3 + t)t) (9 + 8t (3 + (-3 + t)t)) + k (9 -t (3 + (-3 + t)t) (9 + 8t (3 + (-3 + t)t)))]} Σ vv t = 1 - 8 9 (-1 + k)t [3 + (-3 + t)t] {-3 + 4t (3 + (-3 + t)t)}
Remark 13. The expressions above are too complicated. Hence, we provide the python functional bracket in Appendix.E.1 with general initial covariance and diffusion coefficient for easy copy-paste.
Equations above are ones we used through this paper and feel free to play around with other hyperparameters.

Section: D.4 DERIVATION FROM SDE TO ODE FOR PHASE DYNAMICS
One can represent the dynamics in the form of,
dx t dv t = v t F t dt + 0 0 0 g t dw t s.t m 0 := x 0 v 0 ∼ N (µ 0 , Σ 0 )(24)
as
dm t = f (m t )dt + g t dw t
And its corresponding Fokker-Planck Partial Differential Equation Øksendal (2003) reads,
∂p t ∂t = - d ∂ ∂m i [f i (m, t)p t (m t )] + 1 2 d ∂ 2 ∂m i m j d g t g T t p t (m t )(25)
According to eq.( 37) in Song et al. (2020b), One can rewrite such PDE,
∂p t ∂t = - d ∂ ∂m i f i (m t , t)p t (m t ) - 1 2 p(m t )∇ m • (g t g T t ) + p(m t )g t g T t ∇ m log p(m t )(26)
due to the fact
g t ≡ 0 0 0 g t (27) = - d ∂ ∂m i f i (m t , t)p t (m t ) - 1 2 p(m t ) g 2 t ∇ v log p(m t )(28)
Then one can get the equivalent ODE:
dm t = f (m t , t) - 1 2 g 2 t ∇ v log p(m, t) dt(29)

Section: D.5 DECOMPOSITION OF COVARIANCE MATRIX AND REPRESENTATION OF SCORE
Here we follow the procedure in Dockhorn et al. (2021). Given the covariance matrix Σ t , the decomposition of the positive definite symmetric matrix is,
Σ t = L T t L t(30)
Where,
L t = L xx t L xv t L xv t L vv t = Σ xx t 0 Σ xv t √ Σ xx t Σ xx t Σ vv t -Σ vv t Σ xx t(31)
We borrow results from Dockhorn et al. (2021), the score function reads,
∇ m log p(m t |m 1 ) = -∇ mt 1 2 (m t -µ t )Σ t -1 (m t -µ t ) = -Σ t -1 (m t -µ t ) Cholesky decomposition of Σ t = -L -T L -1 (m t -µ t ) = -L -T ϵ
The form of L reads,
L t = Σ xx t 0 Σ xv t √ Σ xx t Σ xx t Σ vv t -(Σ xv t ) 2 Σ xx t
and the transpose inverse of L reads,
L -T t =    1 √ (Σ xx t +ϵxx) -Σ xv t √ (Σ xx t ) √ (Σ xx t )(Σ vv t +)-(Σ xv t ) 2 0 √ Σ xx t √ (Σ xx t )(Σ vv t )-(Σ xv t ) 2   
Hence, the score function reads,
∇ v log p(m t |m 1 ) = - Σ xx t (Σ xx t + ϵ xx )(Σ vv t + ϵ vv ) -(Σ xv t ) 2 ℓt ϵ 1 D.6 REPRESENTATION OF ACCELERATION a t
As been shown in Proposition.3, the optimal control can be represented as,
a * t = g 2 t P 11 x 1 -x t 1 -t -v t = g 2 t P 11 x 1 1 -t -g 2 t P 11 x t 1 -t + v t = g 2 t P 11 x 1 1 -t -g 2 t P 11 µ x t + L xx t ϵ 0 1 -t + (µ v t + L xv t ϵ 0 + L vv t ϵ 1 ) = g 2 t P 11 x 1 -µ x t 1 -t -µ v t - L xx t 1 -t ϵ 0 + L xv t ϵ 0 + L vv t ϵ 1
solving eq.D.3 we can get :
µ x t = 1 3 x 1 t 2 (t 2 -4t + 6), µ v t = 4tx 1 3 (t 2 -3t + 3) Plug inx t , v t = g 2 t P 11 x 1 -1 3 x 1 t 2 6 -4t + t 2 1 -t - 4tx 1 3 (t 2 -3t + 3) - L xx t 1 -t ϵ 0 + L xv t ϵ 0 + L vv t ϵ 1 = g 2 t P 11 (-t 4 + 4t 3 -6t 2 + 3) 3(1 -t) - 4t 3 (t 2 -3t + 3) x 1 - L xx t 1 -t ϵ 0 + L xv t ϵ 0 + L vv t ϵ 1 = g 2 t P 11 -(t -1)(t 3 -3t 2 + 3t + 3) 3(1 -t) - 4t 3 (t 2 -3t + 3) x 1 - L xx t 1 -t ϵ 0 + L xv t ϵ 0 + L vv t ϵ 1 = g 2 t P 11 (t 3 -3t 2 + 3t + 3) 3 - 1 3 (4t 3 -12t 2 + 12t) x 1 - L xx t 1 -t ϵ 0 + L xv t ϵ 0 + L vv t ϵ 1 = g 2 t P 11 (1 -t) 3 x 1 - L xx t 1 -t ϵ 0 + L xv t ϵ 0 + L vv t ϵ 1 = 4(1 -t) 2 x 1 + g 2 t P 11 L xx t 1 -t ϵ 0 + L xv t ϵ 0 + L vv t ϵ 1 D.7 LOSS REWEIGHT
In practice, we use the following loss function
L = min θ E t∈[0,1] E x1∼p data E mt∼pt(mt|x1) λ(t) ∥F θ t (m t , t; θ) -F t (m t , t)∥ 2 2 (32) ∝ min θ E t∈[0,1] E x1∼p data E mt∼pt(mt|x1) 1 1 -t ∥F θ t (m t , t; θ) -F t (m t , t)/z t ∥ 2 2 (33)
We admit that this might not be an optimal selection. The motivation behind this is simply increasing the weight of training when t → 1 and normalize the label with normalizer z t .

Section: D.8 NORMALIZER OF AGM-SDE AND AGM-ODE
Since the optimal control term can be represented as,
a * (m t , t) = 4x 1 (1 -t) 2 -g 2 t P 11 L xx t 1 -t + L xv t ϵ 0 + L vv t ϵ 1 .
Then we introduce the normalizer as
z SDE = (4(1 -t) 2 • σ data ) 2 + g 2 t P 11 L xx t 1 -t + L xv t 2 + (L vv t ) 2 z ODE = (4(1 -t) 2 • σ data ) 2 + g 2 t P 11 + g 2 t P 11 L xx t 1 -t + L xv t 2 + g 2 t P 11 L vv t - 1 2 g 2 t ℓ t 2 Where ℓ := Σ xx t Σ xx t Σ vv t -(Σ xv t ) 2

Section: D.9 EXPONENTIAL INTEGRATOR DERIVATION
As suggested by Zhang & Chen (2022), one can write the discretized dynamics as,
x ti+1 v ti+1 = Φ(t i+1 , t i ) x t v t + r j=0 C i,j 0 s θ (m ti-j , t i-j ) Where C i,j = t+δt t Φ(t + δ t , τ ) 0 0 0 z τ k̸ =j τ -t i-k t i-j -t i-k dτ, Φ(t, s) = 1 t -s 0 1(34)
After plugging in the transition kernel Φ(t, s), one can easily obtain the results shown in (11). Remark 14. In light of the momentum system, there are numerous methods for achieving high accuracy in its resolution. However, the practical performance in generative modeling remains untested. We recommend that readers consult the classical numerical physics text book or recent momentum dynamics solver (Pandey et al., 2023;Dockhorn et al., 2021).

Section: D.10 PROOF OF PROPOSITION.5
The estimated data point x 1 can be represented as
xSDE 1 = (1 -t)(F θ t + v t ) g 2 t P 11 + x t , or xODE 1 = F θ t + g 2 t P 11 (α t x t + β t v t ) 4(t -1) 2 + g 2 t P 11 (α t µ x t + β t µ v t )(35)
for SDE and probablistic ODE dynamics respectively, and
β t = L vv t + 1 2P11 ,α t = ( L xx t 1-t +L xv t )-βtL xv t L xx t .
Proof. It is easy to derive the representation of x 1 of the SDE due to the fact that the network is essentially estimating:
F θ t ≈ g 2 t P 11 x 1 -x t 1 -t -v t ⇔ x 1 ≈ (1 -t)(F θ t + v t ) g 2 t P 11 + x t
It will become slightly more complicated for probabilistic ODE cases. We notice that m
t = µ t + Lϵ ⇔ x t = µ x t + L xx t ϵ 1 , v t = µ v t + L xv t ϵ 0 + L vv t ϵ 1
In probabilistic ODE case, the force term can be represented as,  (Dhariwal & Nichol, 2021) Sampling: For Exponential Integrator, we choose the multistep order w = 2 consistently for all experiments. Different from previous work (Dockhorn et al., 2021;Karras et al., 2022;Zhang et al., 2023), we use quadratic timesteps scheme with κ = 2:
F(m t , t) = 4x 1 (1 -t) 2 -
t i = N -i N t 1 κ 0 + i N t 1 κ N κ
Which is opposite to the classical DM. Namely, the time discretization will get larger when the dynamics is propagated close to data. For numerical stability, we use t 0 = 1E-5 for all experiments.
For N F E = 5, we use t N = 0.5 and N F E = 10, T N = 0.7. For the rest of the sampling, we use t N = 0.999.
Due to the fact that EDM (Karras et al., 2022) is using second-order ODE solver, in practice, we allow it to have an extra one NFE as reported for all the tables.

Section: E.1 CODE EXAMPLE FOR COVARIANCE
We will abuse the notation in this coding section. Here we provide the example code for compute the covariance matrix. Here we consider the general case where Σ 0 := m -k √ mn -k √ mn n and the diffusion coefficient is g(t) := p(tt -t) where p is the scaling coefficient and tt is the damping coefficient.
def 

Section: G ABLATION STUDY OF STOKE-BASED CONDITIONAL GENERATION
In order to investigate the diversity and faithfulness of stoke-based conditional generation, we conduct the ablation study with respect to the hyperparameter ξ.
Figure 7: Ablation study for the stoke-based conditional generation. When ξ = 0, it is unconditional generation.Notably, the diversity of the generation will decay when we increase ξ. In order to achieve a balance between faithfulness and diversity, one needs to tune the hyperparameter ξ.

Section: H ADDITIONAL FIGURES
We demonstrate the samples for different datasets with varying NFE. 

Section: A SUPPLEMENTARY SUMMARY
We state the assumptions in Appendix.B. We provide the technique details appearing in Section.3 at Appendix.D. The details of the experiments can be found in Appendix.E. The visualization of generated figures can be found in Appendix.H.

Section: B ASSUMPTIONS
We will use the following assumptions to construct the proposed method. These assumptions are adopted from stochastic analysis for SGM (Song et al., 2021;Yong & Zhou, 1999;Anderson, 1982), (i) p 0 and p 1 with finite second-order moment. (ii) g t is continuous functions, and |g(t)| 2 > 0 is uniformly lower-bounded w.r.t. t. (iii) ∀t ∈ [0, 1], we have ∇ v log p t (m t , t) Lipschitz and at most linear growth w.r.t. x and v.
Assumptions (i) (ii) are standard conditions in stochastic analysis to ensure the existence-uniqueness of the SDEs; hence also appear in SGM analysis (Song et al., 2021).

Section: C STOCHASTIC OPTIMAL CONTROL (SOC) IN THE WILD
In this section, we are going to provide a gentle introduction of the Stochastic Optimal Control (SOC). Our work is majorly relying on the prior work Chen & Georgiou (2015) in which some technical details are missing. Here we first clarify some core derivations that may help the broader audience to understand Chen & Georgiou (2015) and our work.
C.1 LINEAR QUADRATIC STOCHASTIC OPTIMAL CONTROL SOC has wide applications in financial, robotics, and manufacturing. Here we will focus on Linear Quadratic SOC which usually refers to Linear Quadratic Regulator because the dynamic is linear and the objective function is quadratic (Bryson, 1975;Stengel, 1994). The problem states as:
In this formulation, x t means the state, and u t is the control variable. Conceptually, the SOC problem is aiming to design the controller u t to drive the system from point x 0 to x 1 ≡ 0 with minimum effort. In the case of first order system, the control will be the optimal vector field v * t and for the second order system, the control is denoted as the optimal acceleration a * t . The presence of stochasticity, introduced by the Wiener Process denoted as dw t , prevents the system from precisely converging to the Dirac mass x 1 . In order to strike a balance between the objective of converging to x 1 and minimizing overall control effort ∥u t ∥ 2 2 dt, the terminal cost x T 1 Rx 1 has been imposed. One special case is R → ∞. Intuitively, it means the controlled dynamics should precisely converge to x 1 . However, one can notice that the stochastic trajectory which connects x 0 and x 1 is not unique in this case. Based on this constraint (pinned down at x 1 and x 0 at two boundaries), the optimization problem of SOC finds the optimal solution with minimum effort u t which can be understood as the regularization of the trajectories, hence, such stochastic trajectory is unique while the regularization of controller is still applied. One can also draw connection with such pinned-down SDE with wellknown Doob-h transform. For the people who are not familiar with these, here are some interesting paper (Heng et al., 2021;O'Connell, 2003).
The classical procedure to solve the SOC problem includes:
1. write down the Hamilton-Jacobi-Bellman equation ( 

Section: F CONDITIONAL GENERATION DETAILS
Here we provide the detail of conditional generation details.

Section: F.1 STORKE BASED GENERATION
For stroke based generation, we provide two types of conditional generation.
initial Velocity (IV):Please refer to section.4. Dynamics Velocity (dyn-V):Since the mean and variance of velocity and position are available, one can specify the velocity which is valid. In this case, we can set the velocity as
In which,
when t ≤ c. The c is the guidance length. We typically set it to be c = 0.25.

Section: F.2 INPAINTING
In the inpainting case, we apply the similar strategy as dyn-V. Specifically, in this case, the x1 will be represented as:
where MASK represents for the mask matrix which zero-out the pixel of the original image. Such x1 will serve as the source to estimate µ x t in eq.37.

Section: F.3 INPAINTING BASED GENERATION
For stroke based generation, we provide two types of conditional generation. 


References:
[b0] Brian Do Anderson (1982). Reverse-time diffusion equation models. Stochastic Processes and their Applications
[b1] Fan Bao; Chongxuan Li; Jun Zhu; Bo Zhang (2022). Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. 
[b2] Arthur Earl Bryson (1975). Applied optimal control: optimization, estimation and control. CRC Press
[b3] Tianrong Chen; Guan-Horng Liu; Molei Tao; Evangelos A Theodorou (2023). Deep momentum multimarginal schr\" odinger bridge. 
[b4] Yongxin Chen; Tryphon Georgiou (2015). Stochastic bridges of linear systems. IEEE Transactions on Automatic Control
[b5] Yunjey Choi; Youngjung Uh; Jaejun Yoo; Jung-Woo Ha (2020). Stargan v2: Diverse image synthesis for multiple domains. 
[b6] Guan-Horng Valentin De Bortoli; Tianrong Liu; Evangelos A Chen; Weilie Theodorou;  Nie (2023). Augmented bridge matching. 
[b7] Jia Deng; Wei Dong; Richard Socher; Li-Jia Li; Kai Li; Li Fei-Fei (2009). Imagenet: A large-scale hierarchical image database. Ieee
[b8] Prafulla Dhariwal; Alex Nichol (2021). Diffusion models beat gans on image synthesis. 
[b9] Tim Dockhorn; Arash Vahdat; Karsten Kreis (2021). Score-based generative modeling with criticallydamped langevin diffusion. 
[b10] G Ulrich; Etienne Haussmann;  Pardoux (1986). Time reversal of diffusions. The Annals of Probability
[b11] Jeremy Heng; Valentin De Bortoli; Arnaud Doucet; James Thornton (2021). Simulating diffusion bridges with score matching. 
[b12] Martin Heusel; Hubert Ramsauer; Thomas Unterthiner; Bernhard Nessler; Sepp Hochreiter (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems
[b13] Jonathan Ho; Ajay Jain; Pieter Abbeel (2020). Denoising diffusion probabilistic models. 
[b14] Marlis Hochbruck; Alexander Ostermann (2010). Exponential integrators. Acta Numerica
[b15]  (2008). Stochastic optimal control theory. ICML
[b16] Tero Karras; Miika Aittala; Timo Aila; Samuli Laine (2022). Elucidating the design space of diffusionbased generative models. Advances in Neural Information Processing Systems
[b17] E Donald;  Kirk (2004). Optimal control theory: an introduction. Courier Corporation
[b18] Eckhard Peter E Kloeden;  Platen; Eckhard Peter E Kloeden;  Platen (1992). Stochastic differential equations. Springer
[b19] Alex Krizhevsky; Geoffrey Hinton (2009). Learning multiple layers of features from tiny images. 
[b20] Christian Léonard; Sylvie Roelly; Jean-Claude Zambrini (2014). Reciprocal processes. a measuretheoretical point of view. 
[b21] Yaron Lipman; Ricky Tq Chen; Heli Ben-Hamu; Maximilian Nickel; Matt Le (2022). Flow matching for generative modeling. 
[b22] Guan-Horng Liu; Arash Vahdat; De-An Huang; Evangelos A Theodorou; Weili Nie; Anima Anandkumar (2023). I2sb: Image-to-image schr\" odinger bridge. 
[b23] Xingchao Liu; Lemeng Wu; Mao Ye; Qiang Liu (2022). Let us build bridges: Understanding and extending diffusion generative models. 
[b24] Ilya Loshchilov; Frank Hutter (2017). Decoupled weight decay regularization. 
[b25] O' Neil;  Connell (2003). Conditioned random walks and the rsk correspondence. Journal of Physics A: Mathematical and General
[b26] Bernt Øksendal (2003). Stochastic differential equations. Springer
[b27] Kushagra Pandey; Maja Rudolph; Stephan Mandt (2023). Efficient integrators for diffusion generative models. 
[b28] Stefano Peluchetti (2021). Non-denoising forward-time diffusions. 
[b29] Stefano Peluchetti (2023). Diffusion bridge mixture transports, schr\" odinger bridge problems and generative modeling. 
[b30] Aram-Alexandre Pooladian; Heli Ben-Hamu; Carles Domingo-Enrich; Brandon Amos; Yaron Lipman; Ricky Chen (2023). Multisample flow matching: Straightening flows with minibatch couplings. 
[b31] Daniel Revuz; Marc Yor (2013). Continuous martingales and Brownian motion. Springer Science & Business Media
[b32] Simo Särkkä; Arno Solin (2019). Applied stochastic differential equations. Cambridge University Press
[b33] Yuyang Shi; Valentin De Bortoli; George Deligiannidis; Arnaud Doucet (2022). Conditional simulation using diffusion schrödinger bridges. PMLR
[b34] Yuyang Shi; Valentin De Bortoli; Andrew Campbell; Arnaud Doucet (2023). Diffusion schr\" odinger bridge matching. 
[b35] Ram Vignesh; Matteo Somnath; Ya-Ping Pariset; Maria Rodriguez Hsieh; Andreas Martinez; Charlotte Krause;  Bunne (2023). Aligned diffusion schr\" odinger bridges. 
[b36] Jiaming Song; Chenlin Meng; Stefano Ermon (2020). Denoising diffusion implicit models. 
[b37] Yang Song; Jascha Sohl-Dickstein; P Diederik; Abhishek Kingma; Stefano Kumar; Ben Ermon;  Poole (2020). Score-based generative modeling through stochastic differential equations. 
[b38] Yang Song; Conor Durkan; Iain Murray; Stefano Ermon (2021). Maximum likelihood training of scorebased diffusion models. 
[b39] F Robert;  Stengel (1994). Optimal control and estimation. Courier Corporation
[b40] Alexander Tong; Nikolay Malkin; Kilian Fatras; Lazar Atanackovic; Yanlei Zhang; Guillaume Huguet; Guy Wolf; Yoshua Bengio (2023). Simulation-free schr\" odinger bridges via score and flow matching. 
[b41] Alexander Tong; Nikolay Malkin; Guillaume Huguet; Yanlei Zhang; Jarrid Rector-Brooks; Kilian Fatras; Guy Wolf; Yoshua Bengio (2023). Improving and generalizing flow-based generative models with minibatch optimal transport. 
[b42] Jiongmin Yong; Xun Yu Zhou (1999). Stochastic controls: Hamiltonian systems and HJB equations. Springer Science & Business Media
[b43] Qinsheng Zhang; Yongxin Chen (2022). Fast sampling of diffusion models with exponential integrator. 
[b44] Qinsheng Zhang; Molei Tao; Yongxin Chen (2022). gddim: Generalized denoising diffusion implicit models. 
[b45] Qinsheng Zhang; Jiaming Song; Yongxin Chen (2023). Improved order analysis and design of exponential integrator for diffusion models sampling. 

Figures:
Figure fig_0: 
Type: figure
Caption: 1: Input: trained F(•, •; θ), discretized time step [t 0 ,• • • ,t i ], Choose the sampler from [SSS(SDE), EI(ODE)]. Choose prior mean and covariance µ 0 , Σ 0 2: Sample m 0 ∼ p 0 (m; µ 0 , Σ 0 ). 3: for n = 0 to i do 4: estimate F θ tn (m tn , t n ) 5:m tn+1 = Sampler(m tn , F θ tn , t n )
Data: 

Figure fig_1: 4
Type: figure
Caption: Figure 4 :4Figure 4: Comparison with EDM(Karras et al., 2022) on AFHQv2 dataset. AGM-ODE exhibits superior generative performance when NFE is exceedingly low, owing to its unique dynamics architecture that incorporates velocity when predicting the estimated data point.
Data: 

Figure fig_2: 5
Type: figure
Caption: Figure 5 :5Figure 5: We showcase that AGM can generate conditional results from an unconditional model by injecting the conditional information into the velocity v 0 , thus leading to new initial velocity v cond 0 .
Data: 

Figure fig_3: 
Type: figure
Caption: 2 De Bortoli et al. (2023). So in our case, we are not solving the momentum Schrödinger Bridge as shown in Chen et al. (2023) (also see. fig.6), even tough the problem formulation is similar. Specifically, AGM is a special case of momentum Schrödinger Bridge when the boundary conditions are degenerated to Dirac Distribtuions.
Data: 

Figure fig_4: 6
Type: figure
Caption: Figure 6 :6Figure 6: momentum Schrodinger Bridge versus AGM.
Data: 

Figure fig_5: 8
Type: figure
Caption: HFigure 8 :8Figure8: The comparison with CLD(Dockhorn et al., 2021) using same network and stochastic sampler SSS, for Multi-Swiss-Roll and Mixture of Gaussian datasets. We achieve visually better results with one order less NFEs.
Data: 

Figure : 
Type: figure
Caption: 
Data: 

Figure : 
Type: figure
Caption: 
Data: 

Figure : 
Type: figure
Caption: 
Data: 

Figure : 
Type: figure
Caption: 
Data: 

Figure : 
Type: figure
Caption: 
Data: 

Figure : 
Type: figure
Caption: 
Data: 

Figure : 
Type: figure
Caption: 
Data: 

Figure : 
Type: figure
Caption: 
Data: 

Figure : 
Type: figure
Caption: 
Data: 

Figure : 
Type: figure
Caption: 
Data: 

Figure : 
Type: figure
Caption: 
Data: 

Figure : 
Type: figure
Caption: 
Data: 

Figure : 
Type: figure
Caption: 
Data: 

Figure : 
Type: figure
Caption: 
Data: 

Figure : 
Type: figure
Caption: 
Data: 

Figure tab_0: 
Type: table
Caption: This property empowers us to allocate the NFE budget selectively within the time interval t ∈ [0, t i ], where t i < t N , effectively reducing the discretization error while maintaining the sampling quality. This insight paves the way for efficient low NFE sampling strategies later. Here we summarized the training and sampling procedure of our method in Algorithm.1 and Algorithm.2 respectively.
Data: t L xxβtL xv t.Proof. See Appendix.D.10Algorithm 1 Training

Figure tab_1: 
Type: table
Caption: Table 1: Comparison of generative models based on their initial (p₀) and terminal (p₁) boundary distributions. Our AGM, operating in phase space, generalizes beyond standard Diffusion Models (DM) by not requiring convergence to a simple Gaussian prior at equilibrium, which often leads to curved trajectories in methods like CLD (see Fig. 1). Instead, AGM's terminal velocity distribution is designed as a convolution of the data distribution with a Gaussian, facilitating straighter and more efficient paths.
Data: Models DM/FM CLD AGM(ours)\np 0 (•) p data (x) p data (x) × N (0, I d ) N (0, Σ 0 × I 2d )\np 1 (•) N (0, I d ) N (0, I d ) × N (0, I d ) p data (x) × p data (x) * N (0, Σ 1 ⊗ I 2d )

Figure tab_1: 2
Type: table
Caption: 
Data: : FID↓ Comparisonwith CLD(Dockhorn et al.,2021) using same SSS Sam-pler on CIFAR-10.NFE↓ CLD-SDE AGM-SDE20>1007.95019.933.211502.992.6810002.442.46

Figure tab_2: 3
Type: table
Caption: Unconditional CIFAR-10 generative performance
Data: Table 4: Unconditional ImageNet-64generative performanceModel NameNFE↓ FID↓ModelNFE↓ FID↓ODEEDM (Karras et al., 2022) CLD+EI (Zhang et al., 2022) 50 35 FM-OT (Lipman et al., 2022) 142 6.35 1.84 2.26FM-OT(Lipman et al., 2022) 138 14.45 MFM(Pooladian et al., 2023) 132 11.82AGM-ODE(ours)502.46MFM(Pooladian et al., 2023) 4012.97SDEVP (Song et al., 2020b) VE (Song et al., 2020b) CLD (Dockhorn et al., 2021) 1000 2.44 1000 2.66 1000 2.43AGM-ODE(ours) AGM-ODE(ours)40 3010.10 10.07AGM-SDE(ours)1000 2.46AGM-ODE(ours)2010.55Table 5: Performance comparing with fast sampling algorithm using FID↓ metric on CIFAR-10NFE↓ 51020Dynamics Order Model NameEDM (Karras et al., 2022)> 100 15.78 2.231st orderVP+EI (Zhang & Chen, 2022)15.374.173.03dynamicsDDIM (Song et al., 2020a)26.9111.14 3.50Analytic-DPM(Bao et al., 2022)51.4714.06 6.742nd orderCLD+EI (Zhang et al., 2022)N/A13.41 3.39dynamicsAGM-ODE(ours)11.934.602.60

Figure tab_3: 
Type: table
Caption: In order to use linear combination of x t and v t to represent F one needs to match the stochastic term in F t by usingα t L xx t + β t L xv t =By subsitute it back to F t , one can get:F(m t , t) = 4x 1 (1 -t) 2 -g 2 t P 11 [α t (x t -µ x t ) + β t (v t -µ v t )] = 4(1 -t) 2 + g 2 t P 11 (α t µ x t + β t µ v t ) x 1 -g 2 t P 11 [α t x t + β t v t ] ⇔ x 1 = F θ t + g 2 t P 11 (α t x t + β t v t ) 4(t -1) 2 + g 2 t P 11 (α t µ x t + β t µ v t )We stick with hyperparameters introduced in the section.4. We use AdamW(Loshchilov & Hutter, 2017) as our optimizer and Exponential Moving Averaging with the exponential decay rate of 0.9999. We use 8 × Nvidia A100 GPU for all experiments. For further, training setup, please refer to Table.6.
Data: E EXPERIMENTAL DETAILSTable 6: Additional experimental detailsdatasetTraining Iter Learning rate Batch Sizenetwork architecturetoy0.05M1e-31024ResNet(Dockhorn et al., 2021)CIFAR-100.5M1e-3512NCSN++(Karras et al., 2022)AFHQv20.5M1e-3512NCSN++(Karras et al., 2022)ImageNet-641.6M2e-4512ADMg 2 t P 11L xx t 1 -t+ L xv tϵ 0 + L vv t ϵ 1 -1 2g 2 t ℓϵ 1L xx t 1 -t+ L xv t,ζtβ t L vv t = L vv t +1 2P 11.ζtThe solution can be obtained by:β t =ζ t L vv tα t =ζt -β t L xv t L xx t


Formulas:
Formula formula_0: p 0 (•) p data (x) p data (x) × N (0, I d ) N (0, Σ 0 × I 2d ) p 1 (•) N (0, I d ) N (0, I d ) × N (0, I d ) p data (x) × p data (x) * N (0, Σ 1 ⊗ I 2d )

Formula formula_1: dx t = f t (x t )dt + g(t)dw t x 0 ∼ p data (x)(1)

Formula formula_2: dx t = f t (x t ) -g 2 t ∇ x log p(x t , t) dt + g(t)dw t , x 1 ∼ N (0, I d )(2)

Formula formula_3: dx t = f t (x t ) - 1 2 g 2 t ∇ x log p(x t , t) dt, x 1 ∼ N (0, I d )(3)

Formula formula_4: dx t = v t (x, t)dt + g t dw t s.t. (x 0 , x 1 ) ∼ Π 0,1 (x 0 , x 1 ) := p 0 × p 1 (4)

Formula formula_5: min at 1 τ ∥a t ∥ 2 2 dt + (m 1 -m 1 ) T R(m 1 -m 1 ) s.t dx t dv t dmt = v t a t (x t , v t , t) f (m,t) dt + 0 0 0 g t gt dw t , m τ := x τ v τ = x τ v τ , R = r 0 0 r ⊗ I d , x 1 ∼ p data .(5)

Formula formula_6: a * (m t , t) = g 2 t P 11 x 1 -x t 1 -t -v t where : P 11 = -4 g 2 t (t -1) . (6

Formula formula_7: )

Formula formula_8: dx t dv t = v t F t dt + 0 0 0 h t dw t s.t m 0 := x 0 v 0 ∼ N (µ 0 , Σ 0 ), Bridge Matching SDE : F t := F b t (m t , t) ≡ a * t (m t , t), h(t) := g(t), Probablistic ODE : F t := F p t (m t , t) ≡ a * t (m t , t) - 1 2 g 2 t ∇ v log p(m, t), h(t) := 0.(7)

Formula formula_9: p t (m t |x 1 ) = N (µ t , Σ t )

Formula formula_10: Σ t = Σ xx t Σ xv t Σ xv t Σ vv t ⊗ I d , and µ t = µ x t µ v t

Formula formula_11: m t = µ t + L t ϵ = µ t + L xx t ϵ 0 L xv t ϵ 0 + L vv t ϵ 1 , ∇ v log p t := -ℓ t ϵ 1 (8)

Formula formula_12: Σ t = L t L T t ,ϵ = ϵ 0 ϵ 1 ∼ N (0, I 2d ) and ℓ t = Σ xx t Σ xx t Σ vv t -(Σ xv t ) 2 .

Formula formula_13: a * (m t , t) = 4x 1 (1 -t) 2 -g 2 t P 11 L xx t 1 -t + L xv t ϵ 0 + L vv t ϵ 1 .(9)

Formula formula_14: F θ t = s θ t • z t .

Formula formula_15: min θ E t∈[0,1] E x1∼p data E mt∼pt(mt|x1) λ(t) ∥F θ t (m t , t; θ) -F t (m t , t)∥ 2 2 (10)

Formula formula_16: x ti+1 v ti+1 = Φ(t i+1 , t i ) x t v t + w j=0 ti+1 ti (t i+1 -τ ) z τ • M i,j (τ )dτ • s θ t (m ti-j , t i-j )) ti+1 ti z τ • M i,j (τ )dτ • s θ t (m ti-j , t i-j ) Where M i,j (τ ) = k̸ =j τ -t i-k t i-j -t i-k

Formula formula_17: Φ(t, s) = 1 t -s 0 1 .(11)

Formula formula_18: xSDE 1 = (1 -t)(F θ t + v t ) g 2 t P 11 + x t , or xODE 1 = F θ t + g 2 t P 11 (α t x t + β t v t ) 4(t -1) 2 + g 2 t P 11 (α t µ x t + β t µ v t )(12)

Formula formula_19: β t = L vv t + 1 2P11 ,α t = ( L xx t 1-t +L xv t )-

Formula formula_20: V (x t , t) = inf u E 1 t 1 2 ∥u t ∥ 2 2 dτ + x T 1 Rx 1

Formula formula_21: dx t = (Ax t + g t u t )dt + g t dw t

Formula formula_22: V (t, x t ) = inf u E V (t + dt, x t+dt ) + t+dt t 1 2 ∥u t ∥ 2 2 dτ = inf u E 1 2 ∥u t ∥ 2 2 dt + V (t, x t ) + V t (t, x t )dt + V x (t, x)dx + 1 2 tr V xx gg T dt = Plug in the dynamics dx t = • • • = inf u E 1 2 ∥u t ∥ 2 2 dt + V (t, x t ) + V t (t, x t )dt + V x (t, x) T ((Ax t + g t u t )dt + gdw t ) + 1 2 tr V xx gg T dt = inf u 1 2 ∥u t ∥ 2 2 dt + V (t, x t ) + V t (t, x t )dt + V x (t, x) T (Ax t + g t u t )dt + 1 2 tr V xx gg T dt

Formula formula_23: V t + inf u 1 2 ∥u t ∥ 2 2 + V T x (Ax t + g t u t ) + 1 2 tr V xx gg T = 0

Formula formula_24: u * t = -g t V x

Formula formula_25: V t - 1 2 V x gg T V x + V T x Ax t + 1 2 tr V xx gg T = 0

Formula formula_26: -Ξ - 1 2 x T Qx = - 1 2 x T Qgg T Qx T + x T A T Qx + 1 2 tr Qgg T (14)

Formula formula_27: Ξ(1) = 0, Q(1) = R (15)

Formula formula_28: -Q = A T Q + QA -Qgg T Q (16)

Formula formula_29: u * t = -g T Q(t)x t .

Formula formula_30: Ṗ = AP + P A T -gg T (17)

Formula formula_31: Q -1 = AQ -1 + Q -1 A T -gg T ⇔ -Q -1 QQ -1 = AQ -1 + Q -1 A T -gg T ⇔ -Q = QA + A T Q -Qgg T Q

Formula formula_32: u * t = -g T P (t) -1

Formula formula_33: min ut 1 t 1 2 ∥u t ∥ 2 2 dt + r 2 ∥x 1 -x 1 ∥ 2 2 s.t. dx t = u t dt, x 0 = x 0

Formula formula_34: H(t, x, u, γ) = - 1 2 ∥u t ∥ 2 2 + γu t

Formula formula_35: ∂H ∂u t = 0,

Formula formula_36: H(t, x, u, γ) * = 1 2 γ 2 , where u t = γ

Formula formula_37: dx t dt = ∂H * ∂γ = γ dγ dt = ∂H * ∂x = 0 where x 0 = x 0 and γ 1 = -r • (x 1 -x 1 )

Formula formula_38: γ t = γ = -r • (x 1 -x 1 ), hence the solution for x t is x t = x 1 + γt. γ = -r(x 1 -x 1 ) = -r(x 0 + (1 -t)γ -x 1 ) → u * t := γ = r(x 1 -x 0 ) 1 + r(1 -t)

Formula formula_39: dx t = x 1 -x t 1 -t dt + dw t Remark 7.

Formula formula_40: a * (m t , t) = g 2 t P 11 x 1 -x t 1 -t -v t where : P 11 = -4 g 2 t (t -1) . (18

Formula formula_41: )

Formula formula_42: u * t = -gg T P -1 t (m t -Φ(t, 1)m 1 )

Formula formula_43: u * t = -gg T P -1 t (m t -Φ(t, 1)m 1 ) = -gg T P -1 t m t + gg T P -1 t Φ(t, 1)m 1 = - 0 0 0 g 2 P -1 t m t + gg T P -1 t 1 t -1 0 1 m 1 = -g 2 t 0 0 P 10 P 11 m t + 0 0 0 g 2 t P 00 P 01 P 10 P 11 1 t -1 0 1 m 1 = -g 2 t 0 0 P 10 P 11 m t + g 2 t 0 0 P 10 P 11 1 t -1 0 1 m 1 = -g 2 t 0 0 P 10 P 11 m t + g 2 t 0 0 P 10 P 10 (t -1) + P 11 m 1 = 0 g 2 t P 10 (x 1 -x t ) + g 2 t P 10 (t -1) • v 1 + g 2 t P 11 (v 1 -v t ) Plug in v 1 := x 1 -x t 1 -t = 0 g 2 t P 11 x1-xt 1-t -v t

Formula formula_44: u * t ∈ arg min ut∈U E T 0 1 2 ∥u t ∥ 2 dt + x T 1 Rx 1 s.t dm t = 0 1 0 0 A m t dt + u t dt + gdw t m 0 = m 0 , m 1 = m 1 is depited as Ṗ = AP + PA T -gg T . (19

Formula formula_45: )

Formula formula_46: P 1 = R -1 = lim r→inf r 0 0 r -1 = 0 0 0 0(20)

Formula formula_47: P 1 = R -1 = 0 0 0 ω (21)

Formula formula_48: P t = ω(t -1) 2 -1 3 g 2 (t -1) 3 ω(t -1) -1 2 g 2 (t -1) 2 ω(t -1) -1 2 g 2 (t -1) 2 g 2 (1 -t)

Formula formula_49: P -1 t = 1 g 2 (-4ω + g 2 (t -1))(t -1) 12(ω-g 2 (t-1)) (t-1) 2 6(-2ω+g 2 (t-1)) t-1 6(-2ω+g 2 (-1+t)) t-1 12ω -4g 2 (t -1)

Formula formula_50: P 10 = -12ω + 6g 2 (t -1) g 2 [-4ω + g 2 (t -1)](t -1) 2 = -12ω g 2 [-4ω + g 2 (t -1)](t -1) 2 + 6 [-4ω + g 2 (t -1)](t -1) P 11 = 12ω -4g 2 (t -1) g 2 [-4ω + g 2 (t -1)](t -1) = 12ω g 2 [-4ω + g 2 (t -1)](t -1) + -4 [-4ω + g 2 (t -1)]

Formula formula_51: dm t = 0 1 0 0 m t dt is, Φ(t, s) = 1 t -s 0 1

Formula formula_52: u * t = 0 a t ∈ arg min ut∈U T 0 1 2 ∥u t ∥ 2 dt + x T 1 Rx 1 s.t dm t = 0 1 0 0 m t dt + u t dt + g t dw t

Formula formula_53: dm t = 0 1 0 0 m t dt + u t dt + gdw t (22) m 0 = m 0 , m 1 = m 1 (23)

Formula formula_54: dm t = v t F t dt + g t dw t = v t g 2 t P 11 x1-xt 1-t -v t dt + g t dw t = 0 1 - g 2 t P11 1-t -g 2 t P 11 Ft x t v t dt + 0 g 2 t P11 1-t x 1 Dt dt + g t dw t

Formula formula_55: dµ t = Ft µ t dt + Dt dt dΣ t = Ft Σ t dt + Ft Σ t T dt + gg T dt

Formula formula_56: µ x t = 1 3 x 1 t 2 (t 2 -4t + 6) µ v t = 4tx 1 3 (t 2 -3t + 3) Σ xx t = - 1 9 (-1 + t) 2 [-9 + 2(-1 + k)t (3 + (-3 + t)t) (3 + t [3 + (-3 + t)t])] Σ xv t = 1 9 {(-1 + t) [t (3 + (-3 + t)t) (9 + 8t (3 + (-3 + t)t)) + k (9 -t (3 + (-3 + t)t) (9 + 8t (3 + (-3 + t)t)))]} Σ vv t = 1 - 8 9 (-1 + k)t [3 + (-3 + t)t] {-3 + 4t (3 + (-3 + t)t)}

Formula formula_57: dx t dv t = v t F t dt + 0 0 0 g t dw t s.t m 0 := x 0 v 0 ∼ N (µ 0 , Σ 0 )(24)

Formula formula_58: dm t = f (m t )dt + g t dw t

Formula formula_59: ∂p t ∂t = - d ∂ ∂m i [f i (m, t)p t (m t )] + 1 2 d ∂ 2 ∂m i m j d g t g T t p t (m t )(25)

Formula formula_60: ∂p t ∂t = - d ∂ ∂m i f i (m t , t)p t (m t ) - 1 2 p(m t )∇ m • (g t g T t ) + p(m t )g t g T t ∇ m log p(m t )(26)

Formula formula_61: g t ≡ 0 0 0 g t (27) = - d ∂ ∂m i f i (m t , t)p t (m t ) - 1 2 p(m t ) g 2 t ∇ v log p(m t )(28)

Formula formula_62: dm t = f (m t , t) - 1 2 g 2 t ∇ v log p(m, t) dt(29)

Formula formula_63: Σ t = L T t L t(30)

Formula formula_64: L t = L xx t L xv t L xv t L vv t = Σ xx t 0 Σ xv t √ Σ xx t Σ xx t Σ vv t -Σ vv t Σ xx t(31)

Formula formula_65: ∇ m log p(m t |m 1 ) = -∇ mt 1 2 (m t -µ t )Σ t -1 (m t -µ t ) = -Σ t -1 (m t -µ t ) Cholesky decomposition of Σ t = -L -T L -1 (m t -µ t ) = -L -T ϵ

Formula formula_66: L t = Σ xx t 0 Σ xv t √ Σ xx t Σ xx t Σ vv t -(Σ xv t ) 2 Σ xx t

Formula formula_67: L -T t =    1 √ (Σ xx t +ϵxx) -Σ xv t √ (Σ xx t ) √ (Σ xx t )(Σ vv t +)-(Σ xv t ) 2 0 √ Σ xx t √ (Σ xx t )(Σ vv t )-(Σ xv t ) 2   

Formula formula_68: ∇ v log p(m t |m 1 ) = - Σ xx t (Σ xx t + ϵ xx )(Σ vv t + ϵ vv ) -(Σ xv t ) 2 ℓt ϵ 1 D.6 REPRESENTATION OF ACCELERATION a t

Formula formula_69: a * t = g 2 t P 11 x 1 -x t 1 -t -v t = g 2 t P 11 x 1 1 -t -g 2 t P 11 x t 1 -t + v t = g 2 t P 11 x 1 1 -t -g 2 t P 11 µ x t + L xx t ϵ 0 1 -t + (µ v t + L xv t ϵ 0 + L vv t ϵ 1 ) = g 2 t P 11 x 1 -µ x t 1 -t -µ v t - L xx t 1 -t ϵ 0 + L xv t ϵ 0 + L vv t ϵ 1

Formula formula_70: µ x t = 1 3 x 1 t 2 (t 2 -4t + 6), µ v t = 4tx 1 3 (t 2 -3t + 3) Plug inx t , v t = g 2 t P 11 x 1 -1 3 x 1 t 2 6 -4t + t 2 1 -t - 4tx 1 3 (t 2 -3t + 3) - L xx t 1 -t ϵ 0 + L xv t ϵ 0 + L vv t ϵ 1 = g 2 t P 11 (-t 4 + 4t 3 -6t 2 + 3) 3(1 -t) - 4t 3 (t 2 -3t + 3) x 1 - L xx t 1 -t ϵ 0 + L xv t ϵ 0 + L vv t ϵ 1 = g 2 t P 11 -(t -1)(t 3 -3t 2 + 3t + 3) 3(1 -t) - 4t 3 (t 2 -3t + 3) x 1 - L xx t 1 -t ϵ 0 + L xv t ϵ 0 + L vv t ϵ 1 = g 2 t P 11 (t 3 -3t 2 + 3t + 3) 3 - 1 3 (4t 3 -12t 2 + 12t) x 1 - L xx t 1 -t ϵ 0 + L xv t ϵ 0 + L vv t ϵ 1 = g 2 t P 11 (1 -t) 3 x 1 - L xx t 1 -t ϵ 0 + L xv t ϵ 0 + L vv t ϵ 1 = 4(1 -t) 2 x 1 + g 2 t P 11 L xx t 1 -t ϵ 0 + L xv t ϵ 0 + L vv t ϵ 1 D.7 LOSS REWEIGHT

Formula formula_71: L = min θ E t∈[0,1] E x1∼p data E mt∼pt(mt|x1) λ(t) ∥F θ t (m t , t; θ) -F t (m t , t)∥ 2 2 (32) ∝ min θ E t∈[0,1] E x1∼p data E mt∼pt(mt|x1) 1 1 -t ∥F θ t (m t , t; θ) -F t (m t , t)/z t ∥ 2 2 (33)

Formula formula_72: a * (m t , t) = 4x 1 (1 -t) 2 -g 2 t P 11 L xx t 1 -t + L xv t ϵ 0 + L vv t ϵ 1 .

Formula formula_73: z SDE = (4(1 -t) 2 • σ data ) 2 + g 2 t P 11 L xx t 1 -t + L xv t 2 + (L vv t ) 2 z ODE = (4(1 -t) 2 • σ data ) 2 + g 2 t P 11 + g 2 t P 11 L xx t 1 -t + L xv t 2 + g 2 t P 11 L vv t - 1 2 g 2 t ℓ t 2 Where ℓ := Σ xx t Σ xx t Σ vv t -(Σ xv t ) 2

Formula formula_74: x ti+1 v ti+1 = Φ(t i+1 , t i ) x t v t + r j=0 C i,j 0 s θ (m ti-j , t i-j ) Where C i,j = t+δt t Φ(t + δ t , τ ) 0 0 0 z τ k̸ =j τ -t i-k t i-j -t i-k dτ, Φ(t, s) = 1 t -s 0 1(34)

Formula formula_75: xSDE 1 = (1 -t)(F θ t + v t ) g 2 t P 11 + x t , or xODE 1 = F θ t + g 2 t P 11 (α t x t + β t v t ) 4(t -1) 2 + g 2 t P 11 (α t µ x t + β t µ v t )(35)

Formula formula_76: β t = L vv t + 1 2P11 ,α t = ( L xx t 1-t +L xv t )-βtL xv t L xx t .

Formula formula_77: F θ t ≈ g 2 t P 11 x 1 -x t 1 -t -v t ⇔ x 1 ≈ (1 -t)(F θ t + v t ) g 2 t P 11 + x t

Formula formula_78: t = µ t + Lϵ ⇔ x t = µ x t + L xx t ϵ 1 , v t = µ v t + L xv t ϵ 0 + L vv t ϵ 1

Formula formula_79: F(m t , t) = 4x 1 (1 -t) 2 -

Formula formula_80: t i = N -i N t 1 κ 0 + i N t 1 κ N κ
