Title: GLOBAL CONVERGENCE IN NEURAL ODES: IMPACT OF ACTIVATION FUNCTIONS

Abstract: Neural Ordinary Differential Equations (ODEs) have been successful in various applications due to their continuous nature and parameter-sharing efficiency. However, these unique characteristics also introduce challenges in training, particularly with respect to gradient computation accuracy and convergence analysis. In this paper, we address these challenges by investigating the impact of activation functions. We demonstrate that the properties of activation functions-specifically smoothness and nonlinearity-are critical to the training dynamics. Smooth activation functions guarantee globally unique solutions for both forward and backward ODEs, while sufficient nonlinearity is essential for maintaining the spectral properties of the Neural Tangent Kernel (NTK) during training. Together, these properties enable us to establish the global convergence of Neural ODEs under gradient descent in overparameterized regimes. Our theoretical findings are validated by numerical experiments, which not only support our analysis but also provide practical guidelines for scaling Neural ODEs, potentially leading to faster training and improved performance in real-world applications.

Section: INTRODUCTION
Deep neural networks have revolutionized numerous fields, achieving unprecedented success in diverse applications. Within this landscape of innovation, Neural Ordinary Differential Equations (Neural ODEs) (Chen et al., 2018b) represent a paradigm shift, distinguishing themselves through their intrinsically continuous nature and remarkable parameter efficiency. Unlike traditional neural networks, which are characterized by discrete layers, Neural ODEs elegantly model the evolution of hidden states as a continuous-time differential equation. This allows for a more faithful representation of dynamic systems and offers inherent advantages in terms of memory efficiency and computational complexity due to their parameter-sharing mechanism. These unique attributes have propelled Neural ODEs to prominence, making them exceptionally effective for not only conventional machine learning tasks such such as image classification (Chen et al., 2018b) and natural language processing (Rubanova et al., 2019), but also for sophisticated applications involving continuous processes, including time series analysis (Kidger et al., 2020), reinforcement learning (Du et al., 2020), and diffusion models (Song et al., 2020). Despite these compelling advantages, the training of Neural ODEs introduces significant theoretical and practical challenges, particularly in the accurate computation of gradients and the rigorous analysis of their convergence properties.
A primary challenge in the training of Neural ODEs stems from the accurate computation of gradients. In contrast to conventional neural networks, where gradients are efficiently propagated through a discrete chain of layers via backpropagation, Neural ODEs necessitate the numerical solution of both forward and backward (adjoint) ODEs. This reliance on numerical solvers inevitably introduces inherent numerical errors, which can significantly compromise gradient accuracy, leading to sluggish convergence, or even suboptimal model performance (Rodriguez et al., 2022). Furthermore, ensuring the well-posedness of ODE solutions throughout the training trajectory is a non-trivial task. As stipulated by the Picard-Lindelöf Theorem, solutions to ODEs may not always exist globally or may only be locally unique, a condition that can precipitate training divergence or substantial numerical inaccuracies (Gholami et al., 2019;Ott et al., 2020;Sander et al., 2022). Even with the advent of sophisticated numerical solvers (Zhuang et al., 2020a;b;Matsubara et al., 2021;Ko et al., 2023), it remains an open and critical question whether fundamental first-order optimization methods, such as stochastic gradient descent (SGD), can reliably achieve convergence when applied to Neural ODEs. While the strategy of discretizing Neural ODEs into finite-depth networks offers a potential avenue for gradient computation, this approach typically results in an even deeper computational graph (Zhuang et al., 2020a;b), thereby raising fundamental questions regarding the fidelity of these discretized gradients to those of the true continuous model.
A second, equally critical, challenge resides in rigorously analyzing the training dynamics of Neural ODEs. The optimization landscape for training deep neural networks is notoriously non-convex, which fundamentally complicates theoretical analysis. A significant theoretical advancement was made by Jacot et al. (2018), who demonstrated that the intricate training dynamics of overparameterized networks can be effectively characterized through the framework of the Neural Tangent Kernel (NTK). The NTK is shown to converge to a deterministic limit as the network width approaches infinity. This seminal convergence result has empowered researchers to establish robust global convergence guarantees for gradient-based optimization methods in overparameterized regimes, contingent upon the NTK maintaining its strictly positive definite (SPD) property throughout training (Du et al., 2019a;Allen-Zhu et al., 2019;Nguyen, 2021;Gao et al., 2021). The foundational analysis of the NTK's strict positive definiteness was pioneered by Daniely et al. (2016) with the introduction of dual activation for two-layer networks, subsequently extended to deeper, finite-layered architectures (Jacot et al., 2018;Du et al., 2019a). Crucially, these profound theoretical insights are predominantly confined to networks with discrete layers, leaving a significant void: it remains an open and unresolved question whether these essential properties, particularly the SPD of the NTK, extend to continuous-depth models such as Neural ODEs.
In this paper, we directly tackle these fundamental challenges by meticulously investigating the profound impact of activation functions on the training dynamics of Neural ODEs. Our comprehensive analysis unequivocally demonstrates that specific properties of activation functions—namely, smoothness and nonlinearity—are paramount in governing both the well-posedness of ODE solutions and the critical spectral properties of the Neural Tangent Kernel (NTK). Through rigorous theoretical derivations, we establish that smooth activation functions are instrumental in guaranteeing globally unique solutions for both the forward and backward (adjoint) ODEs, thereby ensuring the inherent stability and reliability of the entire training process. Crucially, we successfully extend existing foundational results on the NTK, which were previously confined to discrete-layered neural networks, to the realm of continuous models, proving that the NTK for Neural ODEs is rigorously well-defined. Furthermore, our findings highlight that a higher degree of nonlinearity in the activation function is not only essential for preserving the strictly positive definite (SPD) property of the limiting NTK, but also empirically translates into significantly faster convergence rates for Neural ODEs.
Our key contributions are summarized as follows:
1.  **Well-Posedness and Gradient Equivalence:** We rigorously establish that the smoothness of activation functions is crucial for the well-posedness of both forward and backward ODEs in Neural ODEs, guaranteeing globally unique solutions. We further demonstrate that, under sufficient smoothness, the "optimize-then-discretize" and "discretize-then-optimize" gradient computation methods yield equivalent gradients, a critical insight for stable training.
2.  **NTK for Continuous Models:** We introduce a novel mathematical framework that extends the Neural Tangent Kernel (NTK) theory from discrete-layered networks to continuous-depth Neural ODEs. By approximating Neural ODEs with sequences of finite-depth networks and leveraging the Moore-Osgood theorem, we prove that the NTK of Neural ODEs is rigorously well-defined in the infinite-width limit.
3.  **Strict Positive Definiteness (SPD) of NTK:** We provide a fine-grained analysis of the limiting NTK for Neural ODEs, deriving its integral form. This integral representation reveals that the NTK retains its strictly positive definite (SPD) property if the activation function is non-polynomial, a crucial condition for global convergence. This result addresses a long-standing challenge regarding NTK properties at infinite depth.
4.  **Global Convergence Guarantees:** Building upon the established well-posedness and SPD properties of the NTK, we provide, for the first time, global convergence guarantees for gradient descent in overparameterized Neural ODEs. This theoretical breakthrough bridges the gap between discrete and continuous-depth models in the context of global optimization.
5.  **Empirical Validation and Practical Guidelines:** We conduct extensive numerical experiments that not only validate our theoretical findings but also provide practical insights. Our experiments demonstrate that smooth and nonlinear activation functions accelerate convergence and improve performance. Furthermore, we offer guidelines on proper ODE scaling to prevent damping and highlight the efficiency challenges of adaptive solvers in large-scale Neural ODEs.

Section: PRELIMINARIES


Section: NEURAL ODES
In this paper, we rigorously analyze a canonical form of Neural ODE, denoted as f (x; θ), which is formally defined as:
f (x; θ) = σ v √ n v T ϕ(h T ),(1)
where h t ∈ R n represents the hidden state whose evolution is governed by the following ordinary differential equation:
h 0 = σ u U x/ √ d, and ḣt = σ w W ϕ(h t )/ √ n, ∀t ∈ [0, T ],(2)
Here, ϕ is the chosen activation function2 , x ∈ R d is the input vector, and U ∈ R n×d , W ∈ R n×n , and v ∈ R n constitute the set of learnable parameters. These parameters, collectively denoted by θ := vec(U , W , v), are initialized randomly (Glorot & Bengio, 2010;He et al., 2015) from a standard Gaussian distribution:
U ij , W ij , v i i.i.d.
∼ N (0, 1), (3) with associated variance hyperparameters σ u , σ w , σ v > 0. A critical aspect of Neural ODEs is that their continuous nature precludes the use of standard backpropagation for gradient computation. Instead, we employ the adjoint method (Chen et al., 2018b), which involves solving a backward (adjoint) ODE to compute the necessary gradients. The adjoint state λ t follows the differential equation:
λ T = σ v diag(ϕ ′ (h t ))v/
√ n, and λt = -σ w diag(ϕ ′ (h t ))W T λ t / √ n, (4) where λ t is the adjoint state. Provided that both the forward and backward ODEs are well-posed, the gradients of the model output f θ with respect to the parameters θ can be accurately computed as:
∇ v f (x; θ) = σ v √ n ϕ(h t ), ∇ W f (x; θ) = T 0 σ w √ n λ t ϕ(h t ) ⊤ dt, ∇ U f (x; θ) = σ u √ d λ 0 x ⊤ . (5
)
Further comprehensive details on these derivations are elucidated in Appendix B. A key theoretical result, established in Section 3, demonstrates that if the activation function ϕ is Lipschitz continuous, both the forward and backward ODEs possess globally unique solutions. This crucial well-posedness property is further proven in Section 5 to hold consistently throughout the entire training process.

Section: NEURAL TANGENT KERNEL
Given a training dataset {(x i , y i )} N i=1 , the primary objective is to learn a parameter vector θ that minimizes the empirical loss function:
L(θ) = N i=1 1 2 (f (x i ; θ) -y i ) 2 = 1 2 ∥u -y∥ 2 ,(6)
where u = (u 1 , • • • , u N ) is the prediction vector with u i = f (x i ; θ), and y = (y 1 , • • • , y N ) is the corresponding output vector. The minimization of this loss is typically achieved using gradient descent with a learning rate η > 0: θ k+1 = θ k -η∇ θ L(θ k ).
(7) As established by prior works (Du et al., 2019a;Jacot et al., 2018), under specific conditions, the evolution of the prediction vector u k during training can be accurately approximated as:
u k+1 -y ≈ I -ηH k (u k -y),(8)
where H k ∈ R N ×N is a Gram matrix constructed from the Neural Tangent Kernel (NTK) (Jacot et al., 2018), defined as:
K(x, x; θ) := ⟨∇ θ f (x; θ), ∇ θ f ( x; θ)⟩ .(9)
The stability and convergence of the training dynamics described by Eq. ( 8) are fundamentally governed by the spectral properties of the NTK Gram matrix H k . Specifically, if there exists a strictly positive constant λ 0 > 0 such that λ min (H k ) ≥ λ 0 for all iterations k, then the residual error u ky is guaranteed to decrease exponentially to zero, provided the learning rate η > 0 is chosen sufficiently small (Allen-Zhu et al., 2019;Nguyen, 2021). A significant challenge arises because the parameters θ k are continuously updated during training, causing the NTK K θ to vary over time, which complicates its spectral analysis. Fortunately, seminal research (Yang, 2020;Jacot et al., 2018) has demonstrated that the time-varying NTK K θ converges to a deterministic limiting NTK K ∞ as the network width n → ∞. This crucial result, coupled with the concept of dual activation (Daniely et al., 2016), enables the study of K θ 's spectral properties during training via K ∞ using perturbation theory.
However, these profound theoretical results are inherently limited to finite-depth neural networks. Neural ODEs, by their very definition, are continuous-depth models, effectively representing infinite-depth networks. Consequently, the inductive techniques employed in prior analyses do not directly guarantee that these essential NTK properties, particularly the strict positive definiteness (SPD), will persist as depth tends to infinity. In this paper, we bridge this critical gap. In Section 4.1 and Section 4.2, we introduce a novel mathematical framework specifically designed to analyze Neural ODEs as infinite-depth networks. We rigorously demonstrate that the smoothness of the activation function ϕ is indispensable for preserving these essential NTK properties in Neural ODEs. Furthermore, to precisely characterize the spectral properties of the limiting NTK K ∞ , we conduct a fine-grained analysis, deriving an integral form for the limiting NTK of Neural ODEs. This integral representation provides a crucial insight: the NTK remains strictly positive definite (SPD) if and only if the activation function is non-polynomial. This finding underscores the critical role of activation function nonlinearity in ensuring the well-behaved spectral properties of K ∞ .

Section: WELL-POSEDNESS OF NEURAL ODES AND ITS GRADIENTS
As continuous models, Neural ODEs pose a significant challenge in accurately computing gradients. In this section, we explore the challenges associated with two methods for computing gradients: optimize-then-discretize and discretize-then-optimize (Gholami et al., 2019;Finlay et al., 2020;Onken et al., 2021). Through our exploration, we emphasize the essential role of smoothness in activation functions to guarantee the well-posedness of Neural ODEs and their gradients.

Section: OPTIMIZE-THEN-DISCRETIZE METHOD
As discussed in Section 2, Neural ODEs require numerical ODE solvers to solve the forward and backward ODEs Eq. ( 2) and Eq. ( 4) to compute the gradients, employing a method known as the optimize-then-discretize method. When solving ODEs, ensuring their well-posedness is of primary concern. In Proposition 1, we demonstrate that if ϕ is Lipschitz continuous, the forward and backward ODEs have globally unique solutions, thus ensuring the well-posedness. The detailed proofs are provided in Appendix C.
Proposition 1. For any given T > 0, if the activation function ϕ is Lipschitz continuous with Lipschitz constant L 1 , then the forward ODE Eq. ( 2) and the backward ODE Eq. ( 4) have unique solutions h t and λ t for all t ∈ [0, T ] and x ∈ R d almost surely over random initialization Eq. ( 3).
In addition, λ t (x) = ∂f (x; θ))/∂h t is the solution to the backward ODE.
Although Neural ODEs and their gradients are well-defined under these conditions, this does not guarantee that gradients can be computed accurately by solving the ODEs numerically. One primary issue is that the magnitudes of the hidden state h t and adjoint state λ t can grow exponentially over the time horizon T , leading to accumulated numerical errors. This issue is illustrated in Figure 3, where long-time horizon leads to damping in the early stages of training. To mitigate this problem, we suggest scaling the dynamics by setting σ w = O (1/T ), which ensures that the magnitudes of h t and λ t remain mild and independent of T . This scaling stabilizes the norms, thereby allowing numerical solvers to produce much more accurate gradient estimates.
Additionally, calculating gradients using Eq. ( 5) requires storing the values of h t and λ t at every time step t ∈ [0, T ], which can consume a significant amount of memory in practice. To address this, Chen et al. (2018b) propose solving an augmented backward ODE (defined in Appendix 36 for our setup), which combines an additional gradient state with both the backward ODE and the reversed forward ODE. This approach eliminates the need for storing intermediate states. However, since the hidden state h t is no longer constant in the augmented ODE, additional regularization conditions on the dynamics are typically required to ensure the stability of the solution. Fortunately, we demonstrate that such regularization conditions are unnecessary for Neural ODEs because the Lipschitz continuity of ϕ ensures that h t is well defined for all t ∈ [0, T ]. Therefore, the augmented ODE approach can be used without the need for additional regularization. A detailed discussion of this result can be found in Appendix C.

Section: DISCRETIZE-THEN-OPTIMIZE METHOD
Alternatively, we can discretize the ODE using Euler's method, treating the continuous ODE as a finite-depth Residual Network (ResNet) f L (x; θ)3 with shared parameters across layers:
f L (x; θ) = σ v √ n v ⊤ ϕ(h L (x)),(10a)
h ℓ = h ℓ-1 + κ • σ w √ n W ϕ(h ℓ-1 ), ∀ℓ ∈ {1, 2, • • • , L}(10b)
h 0 = σ u √ d U x,(10c)
where κ = T /L represents the time step. The gradient can then be estimated using backpropagation through the finite depth ResNet f L (x; θ), referred to as the discretize-then-optimize approach.
As a finite-depth ResNet, the gradients of f L (x; θ) are always well defined. However, it remains an open question whether the gradients of the finite-depth approximation f L (x; θ) converge to the gradients of the continuous Neural ODE f (x; θ) as the depth L → ∞. In Proposition 2, we demonstrate that the smoothness of ϕ ensures this convergence. Thus, in the limit of infinite depth (or infinitesimally small time steps), both the optimize-then-discretize and discretize-then-optimize methods yield the same gradients, provided that the activation function is sufficiently smooth. The detailed proofs are provided in Appendix E.
Proposition 2. Given x ∈ R d , if the activation function ϕ and its derivative ϕ ′ are L 1 -and L 2 -Lipschitz continuous, respectively, the following inequalities hold a.s. over random initialization:
∥∇ θ f L (x) -∇ θ f (x)∥ ≤ CL -1 , ∀ℓ ∈ {0, 1, • • • , L},(11)
where C > 0 is a constant depending only on L 1 , L 2 , T , σ v , σ w , σ u , and ∥x∥.
To further validate our theoretical findings, we conduct experiments that compare training efficiency and gradient accuracy with and without the Lipschitz continuity of ϕ ′ . These experiment results, illustrated in Figure 7, demonstrate the necessity of Lipschitz continuity for ensuring smooth gradient computation and achieving faster convergence during training.

Section: NNGP AND NTK FOR NEURAL ODES
Understanding how activation and gradient propagate through neural networks is crucial for analyzing their training dynamics and generalization abilities, as emphasized in previous studies (Glorot & Bengio, 2010;Poole et al., 2016;Schoenholz et al., 2017). The frameworks of Neural Network Gaussian Processes (NNGP) (Lee et al., 2018) and Neural Tangent Kernels (NTK) (Jacot et al., 2018), grounded in mean-field theory, provide powerful analytical tools to study these dynamics. In this section, we establish theoretical results that demonstrate the well-defined nature of NNGP and NTK for Neural ODEs and explore their properties with respect to information propagation.

Section: NNGP: FORWARD PROPAGATION OF INPUTS
Previous work has shown that in the infinite-width limit, randomly initialized finite-depth neural networks converge to Gaussian processes with mean zero and recursively defined covariance functions, known as NNGP kernels (Neal, 2012;Lee et al., 2018;Daniely et al., 2016;Yang, 2019;Gao et al., 2023;Gao, 2024). When approximating Neural ODEs using a sequence of finite-depth ResNets f L θ , we can establish the NNGP for f L θ . Detailed proofs are provided in Appendix D. Proposition 3. Suppose ϕ is L 1 -Lipschitz continuous. Then, as width n → ∞, the finite-depth neural network f L θ defined in Eq. ( 10) converges in distribution to a centered Gaussian Process with a covariance function or NNGP kernel Σ L+1 := C L+1,L+1 defined recursively:
C 0,k (x, x) = δ 0,k σ 2 u d x T x, ∀k ∈ {0, 1, • • • , L + 1} (12) C ℓ,k (x, x) = σ 2 w Eϕ(u ℓ-1 )ϕ(ū k-1 ), ∀ℓ, k ∈ {1, 2, • • • , L + 1} (13
)
where κ = T /L and (u ℓ , ūk ) are centered Gaussian random variables with covariance
E(u ℓ ūk ) = C 0,0 (x, x) + κ 2 ℓ,k i,j=1 C i,j (x, x), ∀ℓ, k ∈ {0, 1, • • • , L}.(14)
Although Proposition 3 shows that a sequence of Gaussian processes (GPs) can be derived for the sequence of f L θ , this does not necessarily mean that the Neural ODE f θ will also converge to a Gaussian Process as L → ∞. The challenge lies in the difference between two convergence patterns: infinite-width-then-depth and infinite-depth-then-width. These often lead to different limits. For example, consider the simple double sequence a n,ℓ := n/(ℓ + n). This double sequence demonstrates how taking different convergence paths-first in width, then in depth, or vice versa-can yield different results. Specifically, we observe that
lim ℓ→∞ ( lim n→∞ a n,ℓ ) = 1, while lim n→∞ ( lim ℓ→∞ a n,ℓ ) = 0.
This phenomenon has been noted in several recent studies (Hayou & Yang, 2023;Yang et al., 2024;Gao et al., 2023;Gao, 2024) across various neural network architectures. Specifically, for commonly used neural networks, the two convergence patterns often do not coincide, leading to different limits for infinite-depth networks. Hence, the NNGP correspondence does not generally hold for infinitedepth neural networks. For Neural ODEs, the infinite-depth-then-width convergence pattern is more relevant, as Neural ODEs are equivalent to infinite-depth neural networks from the standpoint of numerical discretization. However, the continuous nature and parameter sharing in Neural ODEs present unique challenges that make previous mathematical tools inapplicable directly.
Fortunately, if the activation function ϕ is sufficiently smooth, we can show that these two limits commute, and both convergence patterns share the same limit. One crucial intermediate result involves proving that the double sequence ⟨ϕ(h L ), ϕ( hL )⟩/n converges in depth L uniformly in width n (almost surely). This uniform convergence ensures that as depth increases, the behavior of the system remains stable regardless of width, which is crucial for showing that the limits commute and establishing the well-posedness of the NNGP for Neural ODEs. The proof relies on Euler's convergence theorem and is provided in Appendix D. Lemma 1. Let ϕ be L 1 -Lipschitz continuous. For any x, x ∈ S d-1 , the double sequence
⟨ϕ(h L ), ϕ( hL )⟩/n satisfies ⟨ϕ(h L ), ϕ( hL )⟩ -⟨ϕ(h T ), ϕ( hT )⟩ /n ≤ CL -1 , (15
)
where C > 0 is a constant depending solely on L 1 , σ w , σ u , and T .
By combining Lemma 1 and Proposition 3 with Moore-Osgood theorem, as stated in Theorem 8 of Appendix A, we can establish the NNGP correspondence for the Neural ODE f θ . Theorem 1. Suppose L 1 -Lipschitz continuous ϕ. As width n → ∞, the Neural ODE f θ defined in Eq. ( 1) converges in distribution to a centered Gaussian process with covariance function Σ * defined as the limit of Σ L given in Proposition 3.
Remark 1. Thanks to the uniform convergence result established in Lemma 1, the covariance function
Σ L converges to Σ * with a rate of |Σ L (x, x) -Σ * (x, x)| ∼ CL -1
. This polynomial rate of convergence preserves the geometry of the input space (Yang & Schoenholz, 2017). This stands in contrast to classical feedforward networks, where the input space geometry often collapses unless the variance hyperparameters are set precisely on the edge of chaos (Poole et al., 2016).

Section: NTK: BACKPROPAGATION OF GRADIENTS
While NNGP governs the forward propagation of inputs, the NTK (Jacot et al., 2018) governs the backward propagation of gradients. Understanding both is key to comprehending the full dynamics of Neural ODEs during training. As defined for Neural ODEs in Eq. ( 9), we can also define the NTK K L θ for the finite-depth network f L θ in Eq. (10) as follows:
K L (x, x; θ) := ∇ θ f L (x; θ), ∇ θ f L ( x; θ) .(16)
In the same infinite-width limit, as highlighted in previous works (Jacot et al., 2018;Yang, 2020), the NTK K L θ converges to a deterministic kernel K L ∞ that remains constant throughout training. Notably, this deterministic limiting NTK K L ∞ (and K ∞ defined in Theorem 2) governs the training dynamics of Neural ODEs under gradient descent.
Below are the results for our setup, with proofs provided in Appendix E. Proposition 4. Suppose ϕ is L 1 -Lipschitz continuous. Then, as the network width n → ∞, the NTK K L θ converges almost surely to a deterministic limiting kernel: ∀L ≥ 0
K L ∞ (x, x) = C L+1,L+1 (x, x) + κ 2 L ℓ,k=1 C ℓ,k (x, x)D ℓ,k (x, x) + C 0,0 (x, x)D 0,0 (x, x), (17
)
where C ℓ,k are defined in Proposition 3 and D ℓ,k are defined recursively:
D L,k (x, x) = σ 2 w Eϕ ′ (u L )ϕ ′ (ū L ), ∀k ∈ {0, 1, • • • , L},(18)
D ℓ,k (x, x) = κ 2 ℓ+1,k+1 i,j=L D i,j (x, x)E[ϕ ′ (u i )ϕ ′ (ū j )] ∀ℓ, k ∈ {1, 2, • • • , L -1}. (19
)
The same problem of different convergence patterns converging to different limits, observed in the NNGP kernel Σ * , also arises when computing the NTK of Neural ODEs. While the Lipschitz continuity of ϕ enables well-posed forward propagation of inputs in Neural ODEs, additional regularity is required for backward propagation of gradients. Specifically, Lipschitz continuity of ϕ ′ is sufficient to ensure uniform convergence in the NTK. With ϕ and ϕ ′ both being Lipschitz continuous, we can obtain a uniform convergence result similar to Lemma 1. Lemma 2. If ϕ is L 1 -Lipschitz continuous and ϕ ′ is L 2 -Lipschitz continuous, then the following inequality holds almost surely:
K L θ (x, x) -K θ (x, x) ≤ CL -1 , ∀x, x ∈ S d-1 ,(20
) where C > 0 is a constant dependent only on the constants σ v , σ w , σ u , L 1 , L 2 , and T .
Combining Lemma 2 with Proposition 4 and Moore-Osgood Theorem 8, we can interchange the limits L and n in the double sequence K L θ (x, x) and show that the NTK K θ of Neural ODE converges to a deterministic limiting kernel. Theorem 2. Suppose ϕ is L 1 -Lipschitz continuous and ϕ ′ is L 2 -Lipschitz continuous. As the network width n → ∞, the NTK K θ converges almost surely to a deterministic limiting kernel:
K θ → K ∞ , as n → ∞,(21)
where K ∞ is the limit of the NTK K L ∞ defined in Proposition 4, as depth L → ∞. Remark 2. Using the uniform convergence from Lemma 2, we observe that ∥K L ∞ (x, x) -K ∞ (x, x)∥ ∼ CL -1 . This polynomial convergence not only guarantees that gradients neither explode nor vanish as L → ∞ (Yang & Schoenholz, 2017;Schoenholz et al., 2017), but also implies that the limiting NTK, K ∞ , has an integral form, as suggested by Eq. ( 17) and provided in Appendix E.5. This integral form provides a key insight for studying the spectral properties of the NTK K ∞ directly, without relying on the inductive techniques used in previous works.

Section: GLOBAL CONVERGENCE ANALYSIS FOR NEURAL ODES
As discussed in Eq. ( 8), the dynamics of the residual u ky under gradient descent can be characterized using the NTK K θ . In the infinite-width limit, as shown in Theorem 2, this time-varying kernel K θ converges to a deterministic limiting kernel K ∞ , provided the activation function ϕ is sufficiently smooth. Therefore, in this section, we establish the global convergence of Neural ODEs under gradient descent by examining the spectral property of the NTK K θ and its limit K ∞ .
The limiting NTK K ∞ is a deterministic kernel function, and its spectral properties are key to understanding global convergence. Previous studies (Jacot et al., 2018;Nguyen, 2021) have highlighted that the strictly positive definiteness (SPD) of the NNGP kernel Σ * is sufficient to guarantee the SPD property of K ∞ . Since Σ * is a component of K ∞ defined in Eq. ( 9), demonstrating the SPD property of Σ * is critical for proving convergence.
However, prior analyses have relied on inductive proofs for finite-depth neural networks, which are not directly applicable to infinite-depth and continuous networks like Neural ODEs. That is because, as depth increases, information propagation can become trivial (i.e., gradients vanishing or exploding), potentially diminishing the SPD property at the infinite-depth limit (Poole et al., 2016;Schoenholz et al., 2017;Hayou & Yang, 2023). Fortunately, results in Section 4 demonstrated stable information propagation in both forward and backward directions, regardless of the choice of σ v , σ w , and σ u . This allows us to retain the SPD property of the NTK of Neural ODEs as the depth L approaches infinity. Specifically, recall from Theorem 1 that we can express Σ * as:
Σ * (x, x) = E[ϕ(u)ϕ(ū)],
where (u, ū) are centered Gaussian random variables with covariance S * (x, x) defined by
S * (x, x) = lim L→∞ C 0,0 (x, x) + κ 2 L ℓ,k=1 C ℓ,k (x, x)
with κ = T /L. This expression S * can be interpreted as a double integral form whose explicit form is included in Appendix E.5. By leveraging the results from Section 4, we can derive key properties of S * in Lemma 3. These properties serve as a fundamental basis for analyzing the SPD properties of the NNGP and NTK.
Lemma 3. For any x, x ∈ S d-1 , we have 1. S * (x, x) is well defined, and 0 < S * (x, x) = S * ( x, x) < ∞, 2. S * (x, x) ≥ S * (x, x) and the equality holds if and only if x = x.
Lemma 3 implies S * (x, x) = Θ(1) for all x ∈ S d-1 . This allows us to study the SPD property of the NNGP kernel Σ * using its Hermitian expansion from the perspective of dual activation (Daniely et al., 2016). Detailed analysis and proofs are provided in Appendix F. Additionally, S * (x, x) > S * (x, x) for all x ̸ = x implies that the pathology known as the loss of input dependence, observed in other large-depth networks such as feedforward (Poole et al., 2016), ResNet (Hayou & Yang, 2023), and RNN (Gao et al., 2023) 
x i ∈ S d-1 and x i ̸ = x j for all i ̸ = j; |y i | = O (1),
2. Smoothness: ϕ and ϕ ′ are L 1 -and L 2 -Lipschitz continuous, respectively, 3. Nonlinearity: ϕ is nonlinear and non-polynomial.
Under Assumption 1, we can employ inductive proofs to show that in the overparameterized regime, the parameters θ k remain close to their initialization θ 0 . This proximity ensures that the Neural ODE and its gradients are well-posed not only at initialization, as proved in Proposition 1, but also throughout the entire training. This consistency in parameter updates enables us to prove that the NTK K θ retains SPD during training, ensuring that the training errors of Neural ODEs consistently decrease to zero at a linear rate. Detailed analysis and proofs are provided in Appendix G. Theorem 3. Suppose Assumption 1 holds and the learning rate η is chosen such that 0 < η ≤ 1/∥X∥ 2 . Then for any δ > 0, there exists a natural number n δ such that for all widths n ≥ n δ the following results hold with probability at least 1 -δ over random initialization Eq. ( 3):
1. The parameters θ k stay in a neighborhood of θ 0 , i.e.,
∥θ k -θ 0 ∥ ≤ C∥X∥ L(θ 0 )/λ 0 . (22
)
2. The loss function L(θ k ) consistently decreases to zero at an exponential rate, i.e.,
L(θ k ) ≤ 1 - ηλ 0 16 k L(θ 0 ),(23)
where λ 0 := λ min (K ∞ ) > 0, and the constant C > 0 only depends on L 1 , L 2 , σ v , σ w , σ u , and T .

Section: EXPERIMENTS
In this section, we validate our theoretical findings through several experiments on Neural ODEs. We focus on the approximation errors between the continuous Neural ODE and its finite-depth ResNet approximations, the NTK behavior, and the empirical convergence properties of Neural ODEs under gradient descent. Further experimental details, including additional experiments on smooth vs. nonsmooth activations and scaling for long-horizon stability, can be found in the appendix. decays as 1/L, where L is the depth of the ResNet. We empirically verify this by measuring the output and gradient differences between the continuous Neural ODE and its finite-depth approximation at initialization. Both the Neural ODE and ResNet were initialized with the same random weights and evaluated on the MNIST dataset, with ResNet depths L ranging from 10 to 1,000. We used Softplus activation to ensure smoothness. Figure 1(a)-(b) demonstrates that the approximation error for both outputs and gradients decreases as 1/L, with convergence being uniform across different widths, consistent with our theoretical results. These findings confirm that smooth activation functions lead to well-posed ODE solutions, with accurate approximations by finite-depth networks.
NTK Approximation Error from Finite-Depth ResNet. As discussed in Section 4, the NTK of Neural ODEs can be approximated by the NTK of a finite-depth ResNet, with the approximation error decaying as 1/L. This follows from the fact that the NTK is the inner product of gradients, and as shown in Proposition 2 and Figure 1(a)-(b), the gradient difference between Neural ODEs and finite-depth ResNets also decays as 1/L. By applying the triangle inequality to the gradient differences, it is straightforward to conclude that the NTK approximation error inherits the same 1/L decay rate. Given this reasoning and the page limit, we skip this experiment, but refer readers to Section 4 and Theorem 3 for a detailed theoretical analysis.
NTK Convergence to Deterministic Limiting NTK. In Theorem 2, we prove that as the width of Neural ODEs tends to infinity, the NTK converges to a deterministic limiting NTK. While no theoretical convergence rate is provided, we conducted experiments to empirically investigate this convergence. We evaluated Neural ODE models with increasing widths, ranging from 10 to 1, 000, and computed the NTK for each width. These NTKs were then compared to an approximate limiting NTK derived from random matrix theory. As shown in Figure 1(c)-(d), the NTK converges to the limiting NTK as the width increases. The empirical convergence rate falls between 1/m and 1/ √ m, with a tendency closer to 1/ √ m when plotted on a logarithmic scale. This indicates that Neural ODEs exhibit rapid convergence to their limiting NTK, validating the theoretical analysis.
NTK's SPD and Global Convergence. In Proposition 5 and Corollary 1, we established that the NTK of Neural ODEs is SPD when the activation function is nonlinear but not polynomial, which guarantees global convergence under gradient descent. Specifically, the NTK's smallest eigenvalue remains positive, ensuring the well-conditioning of the model during training. Additionally, we showed that the model parameters remain close to their initial values during training, further supporting the global convergence claim.
To empirically verify these results, we conducted experiments with Neural ODE models of varying widths-500, 1000, 2000, and 4000-while monitoring both the NTK's smallest eigenvalue and the distance of the model parameters from their initial values over 100 epochs. Softplus was used as the activation function to ensure smoothness and non-polynomial nonlinearity. At each epoch, we computed the smallest eigenvalue of the NTK and the Euclidean distance between the current and initial parameter values.
The Smallest Eigenvalue of NTK: As shown in Figure 2(a), we observed that as the width of the Neural ODE increases, the smallest eigenvalue of the NTK becomes larger. For widths greater than the number of training samples (i.e., which is 1000 in our experiments), the smallest eigenvalue remains strictly positive throughout the training process, confirming the NTK's strict positive defi-niteness and ensuring that the model is well-conditioned for gradient descent. However, for widths smaller than the number of training samples, the smallest eigenvalue becomes negative, indicating poor conditioning at smaller widths.
Parameter Distance: The results also confirm that the parameter distance remains stable as training progresses, staying within a manageable bound of O (1), as shown in Figure 2(b). As the width increases, the parameter distance grows, but the growth remains stable and does not deviate significantly. This supports the theoretical result that the parameters do not stray far from their initialization, ensuring stable training and global convergence. Additional Experimental Results. In the appendix, we present supplementary experiments that validate and extend our findings. Without proper scaling (e.g., σ w ∼ 1/T ), Neural ODEs exhibit early-stage damping during training over long-time horizons (see Figure 3). Smooth activations like Softplus converge faster than non-smooth ones like ReLU, likely due to more accurate gradient computation (see Figure 7). Additionally, while non-polynomial nonlinearity is sufficient for an SPD NTK, our experiments show that quadratic activations also yield SPD NTKs, though with slower convergence (see Figure 8). These results highlight the importance of activation functions and model design for Neural ODE performance. We also include convergence analysis on diverse datasets, such as CIFAR-10, AG News, and Daily Climate, as well as additional activations like GELU, further demonstrating the generalizability of our findings.

Section: CONCLUSIONS
In this paper, we examined the crucial role of activation functions in the training dynamics of Neural ODEs. Our findings demonstrate that the choice of activation function significantly impacts the dynamics, stability, and global convergence of the Neural ODE models under gradient descent. Specifically, we found that using smooth activations like Softplus ensures that the forward and backward dynamics in Neural ODEs are well-posed, allowing for accurate approximation by finite-depth ResNets. As a result, the NTK of Neural ODEs converges to a deterministic limiting NTK that governs the model's training dynamics. Additionally, we demonstrated that when using nonlinear but non-polynomial activations, the NTK remains SPD, ensuring well-conditioned training and global convergence. Through extensive experiments, we verified that suitable activation functions, Neural ODEs exhibit stable parameter behavior, rapid NTK convergence, and faster optimization, particularly at larger widths. These findings highlight the importance of selecting activation functions with appropriate smoothness and nonlinearity to ensure the robustness and scalability of Neural ODEs, establishing them as a powerful approach for continuous-time deep learning. 

Section: A USEFUL MATHEMATICAL RESULTS
Theorem 4 (Bai-Yin law, see Vershynin (2010); Bai & Yin (2008)). Let A be an N × n random matrix whose entries and independent copies of a random variable with zero mean, unit variance, and finite fourth moment. Suppose that N and n grow to infinity while the aspect ratio n/N converges to a constant in [0, 1]. Then
s min (A) = √ N - √ n + o √ n , s max (A) = √ N + √ n + o √ n , almost surely. (24) Theorem 5 (Picard-Lindelöf theorem). Let f : [a, b] × R n → R n be a function. If f is continuous
in the first argument and Lipschitz continuous with coefficient L in the second argument, then the ODE
x(t) = f (t, x(t)),(25)
possesses a unique solution on [a -ε, a + ε] for each possible initial value x(a) = x 0 ∈ R n , where ε < 1/L. Theorem 6 (Peano Existence Theorem). If the function f is continuous in a neighborhood of (t 0 , x 0 ), then the ODE equation 25 has at least one solution defined in a neighborhood of t 0 . Theorem 7 (Convergence for Euler's Method). Let x n be the result of applying Euler's method to the ordinary differential equation defined as follows
ẋ = f (x, t), t ∈ [t 0 , t 1 ], and x(0) = x 0 . (26
)
If the solution x has a bounded second derivative and f is L-Lipschitz continuous in x, then the global truncation error is bounded by
∥x(t n ) -x n ∥ ≤ hM 2L (e L(tn-t0) -1), (27
)
where h is the time step, and M is an upper bound on the second derivative of x on the given interval.
Lemma 4 (Gronwall's inequality). Let I = [a, b] for an interval such that a < b < ∞. Let u, α, β be real-valued continuous functions that satisfies the integral inequality
u(t) ≤ α(t) + t 0 β(s)u(s)ds, ∀t ∈ I. (28
)
Then
u(t) ≤ α(t) + t 0 α(s)β(s) exp t s β(r)dr , ∀t ∈ I.(29)
If, in addition, α(t) is non-decreasing, then 
u(t) ≤ α(t) exp t 0 β(s)ds , ∀t ∈ I.
B DERIVATION OF GRADIENT THROUGH ADJOINT METHOD
In this section, we provide the detailed derivation of the adjoint method to compute the gradients. We first recall the forward ODE as follows:
ḣt = σ w √ n W ϕ(h t ), h 0 = σ u √ d U x.
To compute the gradients, we introduce the Lagrange function
L( h, θ, λ, µ) = f θ (x) + T 0 λ ⊤ t σ w √ n W ϕ( h) -ḣ dt + µ ⊤ σ u √ d U x -h(0)]
where h is an extra variable that are independent from θ and (λ, µ) are Lagrangian multipliers.
Observe that with h = h, we have
L(h, θ, λ, µ) = f θ (x), ∀(λ, µ).
Thus, the derivatives of L w.r.t. θ is equal to gradients of f θ w.r.t. θ.
Now, we consider a variation (δh, δθ) at point (h, θ). Then the correspondence variation of L is given by
δL(h, θ, λ, µ) = σ v √ n (δv) ⊤ ϕ(h(T )) + σ v √ n v ⊤ diag(ϕ ′ (h(T )))δh(T ) + µ ⊤ σ u √ d (δU )x -δh(0) + T 0 λ ⊤ σ w √ n (δW )ϕ(h) + σ w √ n W diag(ϕ ′ (h))δh -δ ḣ dt = σ v √ n (δv) ⊤ ϕ(h(T )) + σ v √ n v ⊤ diag(ϕ ′ (h(T )))δh(T ) + µ ⊤ σ u √ d (δU )x -δh(0) -λ ⊤ δh| T 0 + T 0 λ⊤ δhdt + T 0 λ ⊤ σ w √ n (δW )ϕ(h) + σ w √ n W diag(ϕ ′ (h))δh dt = σ v √ n (δv) ⊤ ϕ(h(T )) + σ v √ n v ⊤ diag(ϕ ′ (h(T ))) -λ(T ) T δh(T ) + µ ⊤ σ u √ d (δU )x + (λ(0) -µ) ⊤ δh(0) + T 0 λ⊤ + σ w √ n λ ⊤ W diag(ϕ ′ (h)) δhdt + T 0 σ w √ n λ ⊤ (δW )ϕ(h)dt,
where we use integration by parts in the second equality. Then we choose (λ, µ) such that µ =λ(0),
λ(T ) = σ v √ n diag(ϕ ′ (h(T )))v, λ(t) = - σ w √ n diag(ϕ ′ (h(t)))W ⊤ λ(t).
Then the variation of L becomes
δL(h, θ, λ, µ) = σ v √ n ϕ(h(T )) ⊤ δv + σ u √ d µ ⊤ (δU )x + T 0 σ w √ n λ ⊤ (δW )ϕ(h)dt.
Thus, we obtain the gradients of f θ as
∇ v f θ (x) = σ v √ n ϕ(h(T )) ∇ W f θ (x) = T 0 σ w √ n λ t ϕ(h t ) ⊤ dt ∇ U f θ (x) = σ u √ d λ(0)x ⊤ .

Section: C WELL POSEDNESS OF NEURAL ODES AND ITS GRADIENTS
To show the existence and uniqueness, we first recall the Picard-Lindelöf theorem as follows.

Section: C.1 FORWARD ODE IS WELL-POSED
As we assume the activation function is Lipschitz continuous, we can immediately obtain the local result that the hidden state h t exists near the initial time.
Lemma 5 (Local solution). If the activation function ϕ is L 1 -Lipschitz continuous, then h t uniquely exists for all |t| ≤ ε, where ε < 1/σ w L 1 .
Proof. By using Bai-Yin law 4, we know ∥W ∥ ∼ √ n a.s. Accordingly, we can show the mapping
f : x → σw √ n W ϕ(x) is Lipschitz continuous: ∥f (x) -f (z)∥ =∥ σ w √ n W ϕ(x) - σ w √ n W ϕ(z)∥ ≤σ w ∥ϕ(x) -ϕ(z)∥ ≤σ w L 1 ∥x -z∥.
Hence f is σ w L 1 -Lipschitz continuous a.s. As t 0 = 0, it follows from Picard-Lindelöf theorem that unique h t exists locally for all |t| ≤ ε, where ε < 1/σ w L 1 .
Lemma 6 (Global solution). For any given T > 0, if ϕ is L 1 -Lipschitz continuous, then h t uniquely exists for all |t| ≤ T .
Proof. We have shown unique h t exists locally. Specifically, let ϕ t (x) be the solution flow from initial condition x to the solution at t. For any h 0 , we chose ε < 1/σ w L 1 . Then the solution h 1 := ϕ ε (x 0 ) is well-defined based on the local solution result. As the dynamics is the same and the Lipschitz coefficient is uniform, we have h 2 := ϕ ε (h 1 ) is also well-defined. By repeating this process for any finite steps N , we have
h N = ϕ ε (h N -1
) is well-defined. Hence, as T < ∞, there exist N such that εN ≥ T . Therefore, h t is well-defined for all |t| ≤ T and the desired result is obtained.
Then the result for global solution simply implies that result in Proposition 1.

Section: C.2 BACKWARD ODE IS WELL POSED
Recall the backward ODE as follows
λ T = σ v √ n diag(ϕ ′ (h T )v, (31
) λt = - σ w √ n diag(ϕ ′ (h t ))W ⊤ λ t .(32)
Observe that if h t is well defined in t ∈ [0, T ], then the dynamics of λ t becomes a linear dynamics. Hence, with similar argument, we can easily show the corresponding VIP of λ t is well posed. Lemma 7. Given T , if the activation function ϕ is L 1 -Lipschitz continuous, then λ t is uniquely determined for all |t| ≤ T and λ t = ∂f θ /∂h t is the solution.
Proof. It follows Lemma 5 and 6 that h t is well defined for all t ∈ [0, T ] a.s. By Theorem 5, it suffices to show g :
x → -σ √ n diag[ϕ ′ (h t )]W ⊤ x is Lipschitz continuous: ∥g(x) -g(z)∥ = ∥ σ w √ n diag[ϕ ′ (h t )]W ⊤ (x -z)∥ ≤ σ w L 1 ∥x -z∥,
where we use the fact ∥W ∥ ∼ √ n a.s. by Theorem 4 and |ϕ ′ | ≤ L 1 . Hence, the mapping g is σ w L 1 Lipschitz continuous. It follows from Theorem 5 that λ t uniquely exist for t ∈ [T -ε, T + ε] for ε < 1/σ w L 1 . Then with similar argument, we can show the existence of local solution can be extended to global solution since ϕ is uniformly Lipschitz continuous. Therefore, λ t is well defined for all t ∈ [0, T ].
Additionally, we can show λ(t) = ∂f θ (x) ∂h(t) is a solution. Specifically, the differential of f θ is given by
df θ = dv ⊤ ϕ(h(T ))/ √ n = 1 √ n v ⊤ diag(ϕ ′ (h(T )))dh(T ).
Then we have
∂f θ (x) ∂h(T ) = 1 √ n diag(ϕ ′ (h(T )))v.(33)
Moreover, for any ε > 0, it follows the chain rule that ∂f θ (x)
∂h(t) = ∂h(t + ε) ∂h(t) ∂f θ (x) ∂h(t + ε) .
where we have
h(t + ε) = h(t) + t+ε t 1 √ n W ϕ(h(s))ds. (34
)
Then we have
d dt ∂f θ (x) ∂h(t) = lim ε→0 + ∂f θ ∂h(t+ε) -∂f θ ∂h(t) ε = lim ε→0 + ∂f θ ∂h(t+ε) -∂h(t+ε) ∂h(t) ∂f θ ∂h(t+ε) ε = lim ε→0 + ∂f θ ∂h(t+ε) -∂ ∂h(t) h(t) + 1 √ n W ϕ(h(t))ε + O ε 2 ∂f θ ∂h(t+ε) ε = lim ε→0 + ∂f θ ∂h(t+ε) -I + ε √ n diag(ϕ ′ (h(t)))W ⊤ + O ε 2 ∂f θ ∂h(t+ε) ε = - 1 √ n diag(ϕ ′ (h(t)))W ⊤ ∂f θ ∂h(t) .
Thus We first recall the gradients of f θ w.r.t. θ in a vectorization form:
∂ v f θ (x) = σ v √ n ϕ(h(T )) (35a) ∂ W f θ (x) = T 0 σ w √ n (ϕ(h t ) ⊗ λ t )dt (35b) ∂ U f θ (x) = σ u √ d [x ⊗ λ(0)] .(35c)
The according augmented backward ODE is given by 
 ḣt λt ġt   = σ w √ n   W ϕ(h t ) -diag[ϕ ′ (h t )W ⊤ ]λ t -ϕ(h t ) ⊗ λ t   , ∀t ∈ [0, T ](36)
where g t ∈ R n 2 ×1 and the initial condition is h T and λ T combined with g T = 0.
Once this augmented backward ODE is solved, the gradients of f θ (x) w.r.t. W can be obtained by
∇ W f θ (x) =g(0) =g(T ) + 0 T ġt dt =g(T ) + 0 T - σ w √ n ϕ(h t ) ⊗ λ t dt = T 0 σ w √ n [ϕ(h t ) ⊗ λ t ] dt,
where we use the fact g T = 0. Unlike h t is known in equation 4, h t is an unknown state in the augmented backward ODE equation 36. Hence, it follows from Theorem 5 that extra smoothness is generally required to ensure the well posedness such as ϕ ′ probably need to be Lipschitz continuous. However, the dynamics of h t is decoupled from the dynamics of λ t and g t . Hence, one can solve h t first (in the backward manner), then solve the dynamics system for λ t and g t . In this manner, h t is still an known states. Hence, one can use the same regularity condition in Proposition 2 to show existence of unique solutions for t ∈ [0, T ]. Therefore, no additional smoothness is needed to solve the augmented backward ODE.

Section: D NNGP CORRESPONDENCE FOR NEURAL ODES
In this section, we establish the NNGP correspondence for Neural ODEs. It follows from the Euler method that Neural ODE can be approximated by a finite-depth neural network f L θ equation 10. From the asymptotic perspective, Neural ODEs is equivalent to an infinite-depth ResNet with shared parameters in its all hidden layers and a special depth-dependent scaling hyperparameter T /L.

Section: D.1 FINITE-DEPTH NEURAL NETWORKS AS GAUSSIAN PROCESSES
As the finite-depth neural network f L θ can be considered as an approximation to the Neural ODE f θ , we first study its signal propagation by establishing the NNGP correspondence for f L θ . We define vectors
g ℓ ∈ R n g 0 (x) := σ v √ d U x,(37)
g ℓ (x) := σ w √ n W ϕ(h ℓ-1 ), ∀ℓ ∈ [1, 2, • • • , L].(38)
The vectors g ℓ are G-vars in Tensor program Yang (2019). Tensor program is an representation of the neural network computations that only involves linear and element-wise nonlinear operations. In the paper Yang (2019), the authors claim that a computation using G-vars is equivalent to another computation that using corresponding a list of one-dimensional Gaussian variables in the infinite-width limit, as long as the computation only involves controllable nonlinear functions. The corresponding definitions and Theorems are reformulated as follows.
Definition 1. (Yang, 2019, Simplified version of Definition 5.3) A real-valued function ψ : R k → R is called controllable if there exists some absolute constants C, c > 0 such that |ψ(x)| ≤ Ce c k i=1 |xi| . Theorem 9. (Yang, 2019, Theorem 5.4) Consider a NETSOR program that has forward computation for a given finite-depth neural network. Suppose the Gaussian random initialization and controllable activation functions for the given neural network. For any controllable ψ : R M → R, as width n → ∞, any finite collection of G-vars g α with size M satisfies
1 n n α=1 ψ(g 0 α , . . . , g M α ) a.s. → Eψ(z 0 , • • • , z M ),(39)
where {z 0 , • • • , z M } are Gaussian random variables whose mean and covariance are computed by the corresponding NETSOR Program.
Notably, controllable functions are not necessarily smooth, although smooth functions can be easily shown to be controllable. Moreover, controllable functions, as defined in (Yang, 2019, Definition 5.3), can grow faster than exponential but remain L 1 and L 2 -integrable with respect to the Gaussian measure. However, the simplified definition presented here encompasses almost most functions encountered in practice. Moreover, the vectors g ℓ or G-vars are not necessary to encode the same input x. Hence, g ℓ (x) and g ℓ (x) are two different G-vars in Tensor program. However, Theorem 9 still holds for any finite collection of G-vars, even they have the different input encoded. Therefore, by utilizing Theorem 9, we can show as n → ∞, the finite-depth network f L θ tends to a Gaussian Process weakly and the result is stated in Proposition 3 and the associated Tensor program for f L θ is provided in Algorithm 1.
Published as a conference paper at ICLR 2025
Algorithm 1 ResNet f L θ Forward Computation on Input x Input: U x/ √ d : G(n) Input: W : A(n, n) Input: v : G(n) 1: h 0 := U x/ √ d : G(n) 2: for ℓ ∈ [L] do 3:
x ℓ := ϕ(h ℓ ) : H(n) 4:
g ℓ := W x ℓ-1 / √ n : G(n) 5: h ℓ := h ℓ-1 + κ • g ℓ : G(n) 6: end for 7: x L = ϕ(h L ) : H(n) Output: v T x L / √ n
In the rest of this subsection, we will provide rigours proof to show the NNGP correspondence for f L θ through induction. For simplicity, the proof assume only one input x is given, while the result for multiple inputs is similar. Additionally, we also assume σ v = σ w = σ u = 1 since their values are not significant in the proof as long as their values are strictly positive.
BASIC CASE L = 0 As L = 0, we have f 0 θ (x) = v T ϕ(h 0 )/ √ n.
Hence, we don't have the hidden layers. Based on the random initialization equation 3, we have
g 0 k i.i.d. ∼ :=Σ 0 (x,x)
Let B 0 be the smallest σ-algebra generated by g 0 . By condition on B 0 , we have
f 0 θ |B 0 ∼ N (0, ∥ϕ 0 ∥ 2 /n),
where ϕ 0 := ϕ(h 0 ). It follows from the law of large that
σ 2 v ∥ϕ 0 ∥ 2 /n = σ 2 v n n k=1 ϕ(h 0 k ) 2 = σ 2 v n n k=1 ϕ(g 0 k ) 2 a.s.
-→ Eϕ(z 0 ) 2 := Σ 1 (x, x),
where z 0 ∼ N (0, Σ 0 (x, x)). As the limit is deterministic, the conditional and unconditional distribution converge to the same limit. Therefore, we have
f 0 θ → GP(0, Σ 1 ), where Σ 1 (x, x) = E z 0 ∼Σ 0 ϕ(z 0 (x))ϕ(z 0 (x)).
where we use z 0 ∼ Σ 0 to denote centered Gaussian random variable(s) whose (co)variances can be computed using covariance function Σ 0 .
GENERAL CASE L Now consider f L θ (x) = v T ϕ(h L )/ √ n.
Here we have h L = h L-1 + βg L and g L = W ϕ(h L-1 ), where β := T L . As W is used before, let B L-1 be the smallest σ-algebra generated by {g 0 , • • • , g L-1 }. Then we can have
g ℓ = W ϕ(h ℓ-1 ), ∀ℓ ∈ {1, 2, • • • , L -1} or equivalently g 1 • • • g L-1 :=G = W ϕ 0 • • • ϕ L-2 :=Φ
where G ∈ R n×(L-1) and Φ ∈ R n×(L-1) .
We can obtain the conditional distribution of W by solving the following optimization problem
min W 1 2 ∥W ∥ 2 F , s.t. G = W Φ.
The Lagrange function is given by
L(W, V ) = 1 2 ∥W ∥ 2 F + ⟨V, G -W Φ⟩ Then ∇ W L(W, V ) = W -V Φ T = 0 =⇒ W * = V Φ T .
As G = W Φ, we have
G = W Φ = V Φ T Φ =⇒ V = G(Φ T Φ) † =⇒ W * = G(Φ T Φ) † Φ T .
Thus, we have
W |B = W * + W Π T = G(Φ T Φ) † Φ T + W I n -ΦΦ † ,
where
Π = I n -ΦΦ † , W is i.i.d.copy of W , and Φ † = (Φ T Φ) † Φ T .
Since g L = W ϕ(h L-1 ), we have the conditional distribution of g L k as follows
g L k |B independent ∼ N (G k * (Φ T Φ) † Φ T ϕ, ∥Π T ϕ∥ 2 /n).
where G k * denotes the k-th row of matrix G and ϕ = ϕ L-1 for simplicity.
As Lipschitz continuous activation is controllable function, it follows from Theorem 9 and inductive hypothesis that
ϕ i , ϕ j /n = 1 n n k=1 ϕ(h i k )ϕ(h j k ) = 1 n n k=1 ϕ(g 0 k + βg 1 k + • • • + βg i k )ϕ(g 0 k + βg 1 k + • • • + βg j k ) a.s. → Eϕ(z 0 + βz 1 + • • • + βz i )ϕ(z 0 + βz 1 + • • • + βz j ) =:Eϕ(u i )ϕ(u j ),
where we define another Gaussian random variable u i to simplify the notation:
u i = z 0 + βz 1 + • • • + βz i .
Therefore, we have
(Φ T Φ) ij /n = ϕ i , ϕ j /n a.s. → Eϕ(u i )ϕ(u j ), (Φ T ϕ) i /n = ϕ i , ϕ /n a.s. → Eϕ(u i )ϕ(u L-1 ). For ℓ ∈ {0, 1, • • • , L -1}, let U ℓ = {u 0 , • • • , u ℓ } be a collection of u i . We define Σ(U ℓ , U k ) ∈ R (ℓ+1)×(k+1) as Σ(U ℓ , U k ) ij = Σ(u i , u j ) = Eϕ(u i )ϕ(u j ), ∀i ∈ {0, 1, • • • , ℓ}, j ∈ {0, 1, • • • , k}.
Therefore, we have
(Φ T Φ) † Φ T ϕ = (Φ T Φ/n) † Φ T ϕ/n → Σ(U L-2 , U L-2 ) † Σ(U L-2 , u L-1 ).
Moreover, observe that
∥Π T ϕ∥ 2 /n = 1 n ϕ T (I n -ΦΦ † )ϕ = 1 n ϕ T ϕ - 1 n ϕ T Φ(Φ T Φ) † Φ T ϕ =ϕ T ϕ/n -(ϕ T Φ/n)(Φ T Φ/n) † (Φ T ϕ/n) →Σ(u L-1 , u L-1 ) -Σ(u L-1 , U L-2 )Σ(U L-2 , U L-2 ) † Σ(U L-2 , u L-1 )
Therefore, for any controllable function ψ, it follows from Theorem 9 that
1 n n k=1 ψ(g 0 k , g 1 k , • • • , g L k ) → E ψ(z 0 , z 1 , • • • , z L ) ,
where
Cov(z 0 (x), z ℓ (x)) = 0, ∀ℓ ≥ 1 Cov(z ℓ (x), z k (x)) = E ϕ u ℓ-1 (x) ϕ u k-1 (x) , ∀ℓ, k ≥ 1 Let B L be the smallest σ-algebra generated by {g 0 , • • • , g L }.
By condition on B L , we have
f L θ (x)|B L ∼ N (0, ∥ϕ L ∥ 2 /n) (40) where ∥ϕ L ∥ 2 /n = 1 n n k=1 ϕ(h L k ) 2 = 1 n n k=1 ϕ g 0 k + β L i=1 g i k 2 a.s. → E ϕ z 0 + β L i=1 z i 2 = E[ϕ(u L )] 2 := Σ L+1 (x, x)
Thus, we obtain
f L θ → GP(0, Σ L+1 ) where Σ L+1 (x, x) = E ϕ u L (x) ϕ u L (x) .

Section: D.2 NEURAL ODES AS GAUSSIAN PROCESSES
In this subsection, we prove Neural ODEs tends to a Gaussian process as the width n → ∞. As the output parameter v is independent from all previous weights, by conditioning on the previous hidden layers, the Neural ODEs becomes a Gaussian random variable with covariance ∥ϕ L (x)∥ 2 /n, i.e.,
f θ (x)|B ∼ N
where we denote ϕ T (x) := ϕ(h T (x)) to simplify the notation, h T is the exact solution from the forward ODE, and B is the smallest σ-algebra generated by previous hidden layers. Here we also assume σ v = σ w = σ u = σ as their values are not important in the proof as long as they are strictly positive.
It follows from convergence analysis of Euler's method, stated in Theorem 7, that
ϕ L (x) → ϕ T (x), as L → ∞,
where we denote ϕ ℓ (x) := ϕ(h ℓ (x)).
Thus, the focus of analysis becomes to study the convergence of this double sequence
a n,L := ϕ L (x), ϕ L (x) /n.
By leveraging the convergence result for Euler's method in Theorem 7, we can show the double sequence a n,L converges as L → ∞ and this convergence is uniform in n a.s. Lemma 8. If ϕ is L 1 -Lipschitz continuous, then the following inequalities hold for every x ∈ S d-1 a.s.:
∥h t ∥ ≤ C √ ne CσL1t , ∀t ∈ [0, T ] (41) and ∥h ℓ -h(t ℓ )∥ ≤ A 2B e Bt ℓ -1 T L √ n,(42)
where A := Cσ 2 L 2 1 e CσL1T and B := CσL 1 for some absolute constant C > 0.
Proof. Recall from Lemma 5 that the mapping f :
x → σ √ n W ϕ(x) is σL 1 -Lipschitz continuous. Observe that d( ḣ) =dσW ϕ(h(t))/ √ n = σ √ n W diag [ϕ ′ (h(t))] dh(t) = σ √ n W diag [ϕ ′ (h(t))] ḣ(t)dt = σ √ n W diag [ϕ ′ (h(t))] σ √ n W ϕ(h(t))dt.
Then we have
ḧ = d dt ḣ = σ √ n W diag [ϕ ′ (h(t))] σ √ n W ϕ(h(t))
and
∥ ḧ∥ ≤ C 2 σ 2 L 2 1 ∥h(t)∥
where we use the fact ∥W ∥ ≤ C √ n a.s. from Theorem 4 for some absolute constant C > 0 and ϕ is L 1 -Lipschitz continuous.
Then
h(t) = h(0) + t 0 ḣds implies ∥h(t)∥ ≤∥h(0)∥ + t 0 ∥ σ √ n W ϕ(h(s))∥ds ≤∥h(0)∥ + t 0
CσL 1 ∥h(s)∥ds.
By using the Gronwall's inequality, we have
∥h(t)∥ ≤ ∥h(0)∥ exp t 0 CσL 1 ds = ∥h(0)∥e CσL1t
Additionally, as ∥U ∥ ≤ C √ n almost surely and ∥x∥ = 1, we have ∥h(0)∥ ≤ C √ n, and so we obtain
∥h(t)∥ ≤ C √ ne CσL1t , ∀t ∈ [0, T ].
Therefore, we obtain
∥ ḧ(t)∥ ≤ Cσ 2 L 2 1 √ ne CσL1t , ∀t ∈ [0, T ].
By the Euler's convergence theorem stated in Theorem 7, we have
∥h ℓ -h(t ℓ )∥ ≤ A 2B e Bt ℓ -1 T L √ n,
where A := Cσ 2 L 2 1 e CσL1T and B := CσL 1 .
Lemma 9. Suppose L 1 -Lipschitz continuous activation ϕ and h t (x) is the exact solution with input x. Given L, we have
1 n ϕ(h k (x)), ϕ(h ℓ (x)) - 1 n ⟨ϕ(h t k (x)), ϕ(h t ℓ (x))⟩ ≤ C 1 L -1 , ∀k, ℓ ∈ [L] (43
)
where t k = kβ and C 1 > 0 is some constant that does not dependent on n and L. Therefore, the double sequence ϕ(h k (x)), ϕ(h ℓ (x)) /n converges w.r.t. L and uniformly w.r.t. n.
Proof. For simplicity, we assume the activation function is 1-Lipschitz continuous, i.e., L 1 = 1. For ℓ ≤ k ≤ L, we denote ϕ ℓ = ϕ(h ℓ (x)), φℓ = ϕ(h ℓ (x)), ϕ(t) = ϕ(h t (x)), and φ(t) = ϕ(h t (x)), where h t (x) is the exact solution to the ordinary differential equation that encodes input x. Then we consider
ϕ k , φℓ /n -ϕ(kβ)), φ(ℓβ) /n = 1 n ϕ k , φℓ -φ(ℓβ) + 1 n ϕ k -ϕ(kβ)), φ(ℓβ) ,
where β = T /L is the time step.
Note that
∥h ℓ+1 ∥ = ∥h ℓ + T L σ √ n W ϕ(h ℓ )∥ ≤ ∥h ℓ ∥ + Cσ T L ∥h ℓ ∥ = (1 + CσT /L)∥h ℓ ∥.
where we use the fact that ϕ is 1-Lipschitz continuous and ∥W ∥ ≤ C √ n a.s. Repeat this argument ℓ times and we have
∥h ℓ+1 ∥ ≤ (1 + CσT /L) ℓ+1 ∥h 0 ∥ Therefore, we obtain ∥ϕ ℓ ∥ ≤ ∥h ℓ (x)∥ ≤ (1 + CσT /L) ℓ ∥h 0 ∥ ≤ e CσT ℓ/L ∥h 0 ∥ ≤ C √ ne CσT ℓ/L ,
where we also use ∥U ∥ ≤ C √ n a.s. and ∥x∥ = 1.
Moreover, we have
∥ϕ ℓ -ϕ(ℓβ)∥ ≤ ∥h ℓ -h(ℓβ)∥ ≤ C 1 √ nL -1 ,
where C 1 > 0 is a constant that does not dependent on n and L.
Therefore, we obtain
ϕ k , φℓ /n -ϕ(kβ)), φ(ℓβ) /n ≤ 1 n • C 1 √ n • C 1 √ nL -1 = C 1 L -1 .
Hence, ϕ(h ℓ (x)), ϕ(h k (x)) /n converges w.r.t. L and uniformly in n.
Combining Lemma 9 with Moore-Osgood theorem, stated in Theorem 8, the double sequence a n,L := ϕ(h L (x)), ϕ(h L (x)) /n has both iterated limits that are equal to the double limit, i.e.,
lim n→∞ ⟨ϕ(h T (x)), ϕ(h T (x))⟩ /n = lim n→∞ lim L→∞ ϕ(h L (x)), ϕ(h L (x)) /n = lim L→∞ lim n→∞ ϕ(h L (x)), ϕ(h L (x)) /n = lim L→∞ Σ L+1 (x, x) =Σ * (x, x).
As Σ * is a deterministic function, the conditioned and unconditioned distributions of f θ (x) are equal in the limit: they are centered Gaussian random variables with covariance Σ * (x, x). This complete the proof of Theorem 1

Section: E NTK FOR NEURAL ODE
In this section, we derive the neural tangent kernel (NTK) for Neural ODEs and provide sufficient condition to show when the NTK is well defined for Neural ODEs. Under our exploration, the smoothness of the activation function play an significant role to study the NTK of Neural ODEs. For example, additionally smoothness is required to ensure the uniqueness and existence of the adjoint state λ t in the backward ODE equation 4 or augmented backward ODE equation 36.

Section: E.1 CONVERGENCE ANALYSIS OF EULER'S METHOD FOR BACKWARD ODE
Similar to the forward ODE, we can also discretize the backward ODE as follows:
λℓ+1 = λℓ -β • σ w √ n diag[ϕ ′ (h t ℓ )]W T λℓ , ∀ℓ ∈ [1, 2, • • • , L](44)
where β = T /L and h t is the solution from the forward ODE equation 2 and t ℓ := βℓ. Additionally, we can further discretize h t and obtain 
λ ℓ+1 = λ ℓ -β • σ w √ n diag[ϕ ′ (h ℓ )]W T λ ℓ , ∀ℓ ∈ [1, 2, • • • , L]. (45) As L → ∞ or β → ∞,
∥λ t ∥ ≤ CσL 1 e CσL1(T -t) , ∀t ∈ [0, T ](46)
and
∥λ ℓ -λ t ∥ ≤ T L C 1 C 2 e C2(T -t ℓ ) -1 ,(47)
where
C 1 = CL 2 1 L 2 σ 3 e CσL1T , C 2 = CσL 1 +Cσ 2 L 1 L 2 e
CσL1T for some absolute constant C > 0.
Proof. For the mapping f :
(λ, t) → -1 √ n diag[ϕ ′ (h t )]W T λ, we consider d λ =d - σ √ n diag [ϕ ′ (h(t))] W T λ =d -ϕ ′ (h(t)) ⊙ W T λ = -[dϕ ′ (h t )] ⊙ W T λ = -ϕ ′′ (h t ) ⊙ dh t ⊙ W T λ = -ϕ ′′ (h t ) ⊙ W T λ ⊙ ḣdt = -ϕ ′′ (h t ) ⊙ W T λ ⊙ W ϕ(h t )dt = -diag (ϕ ′′ (h t )) diag W T λ W ϕ(h t )dt,
where ⊙ denotes element-wise product and we denote W = σW/ √ n. Thus, we have
∂ t f (λ, t) = -diag (ϕ ′′ (h t )) diag W T λ W ϕ(h t ).
Let wk be the k-th column of W . As in this case we consider λ as fixed, wT k λ follows Gaussian distribution with zero mean and variance σ 2 ∥λ∥ 2 /n. We obtain the inequality
∥∂ t f (λ, t)∥ ≤ |ϕ ′′ | • σ √ n ∥λ∥ • ∥ W ∥ • ∥ϕ(h t )∥ ≤ CL 1 L 2 σ 2 ∥λ∥ • ∥h t ∥/ √ n,
where we use the assumption |ϕ ′′ | ≤ L 2 and C > 0 is some absolute constant.
Observe that
∥λ t ∥ ≤ ∥λ T ∥ + T t ∥ λ∥ts ≤ CσL 1 + T t CσL 1 ∥λ s ∥ds.
Then it follows from the Gronwall's inequality that
∥λ t ∥ ≤ CσL 1 exp T t CσL 1 ds ≤ CσL 1 e CσL1(T -t) .
Combining the above bound of λ t with equation 41, we have
∥∂ t f (λ, t)∥ ≤ CL 2 1 L 2 σ 3 e CσL1T := C 1 .
Note that C 1 > 0 is independent from L and n.
With argument alike in Theorem 7, we can obtain the global truncation error for λ ℓ . In Proposition 1 and Proposition 2, we have shown the uniqueness and existence of h t and λ t for all t ∈ [0, T ]. To study the convergence of λ ℓ to λ t ℓ , it is equivalent to apply Euler's method to numerically solve λ t in the reverse order from t = 0 to t = T . Hence, we will assume λ 0 is know and provide the global truncation errors for λ ℓ .
Note that
∥λ ℓ+1 -λ(t ℓ+1 )∥ = λ ℓ -βdiag[ϕ ′ (h ℓ )] W T λ ℓ -λ(t ℓ ) + β λ(t ℓ ) + β 2 2 λ(t ℓ ) ≤∥λ ℓ -λ(t ℓ )∥ + β∥diag[ϕ ′ (h ℓ )] W T λ ℓ -diag[ϕ ′ (h(t ℓ ))] W T λ(t ℓ )∥ + β 2 2 C 1 ,
where β = T /L and we use λ(t ℓ ) = ∥∂ t f (t ℓ )∥ ≤ C 1 . Additionally, the triangle inequality implies that
∥diag[ϕ ′ (h ℓ )] W T λ ℓ -diag[ϕ ′ (h(t ℓ ))] W T λ(t ℓ )∥ ≤∥diag[ϕ ′ (h ℓ )] W T (λ ℓ -λ t ℓ )∥ + ∥(diag[ϕ ′ (h ℓ )] -diag[ϕ ′ (h(t ℓ ))]) W T λ(t ℓ )∥ ≤L 1 ∥ W ∥∥λ ℓ -λ t ℓ ∥ + L 2 ∥h ℓ -h t ℓ ∥∥ W ∥∥λ t ℓ ∥ ≤C 2 ∥λ ℓ -λ t ℓ ∥ + ∥h ℓ -h t ℓ ∥ ,
where the constant
C 2 = CσL 1 + Cσ 2 L 1 L 2 e CσL1T
. Hence, we have
∥λ ℓ+1 -λ t ℓ+1 ∥ ≤ ∥λ ℓ -λ(t ℓ )∥ + βC 2 ∥λ ℓ -λ t ℓ ∥ + ∥h ℓ -h t ℓ ∥ + β 2 C 1 . Denote E ℓ = ∥λ ℓ -λ t ℓ ∥ + ∥h ℓ -h t ℓ ∥, then we have ∥λ ℓ -λ t ℓ ∥ ≤ E ℓ ≤ (1 + βC 2 )E ℓ-1 + β 2 C 1 .
By the induction, we have
E ℓ ≤ (1 + βC 2 ) ℓ E 0 + β 2 C 1 • (1 + βC 2 ) ℓ -1 (1 + βC 2 ) -1 .
Since E 0 = 0 and β = T /L, we have
E ℓ ≤ T L C 1 C 2 e C2(T -t ℓ ) -1 .
This completes the proof.
Additionally, as we have λ t = ∂f θ /∂h t is the solution to the backward ODE. We have
∂f θ ∂h t ℓ - ∂f L θ ∂h ℓ ≤ C 0 L -1 , ∀ℓ ∈ [1, 2, • • • , L],(48)
where C 0 > 0 is some constant that is not dependent on n and L.

Section: E.2 GRADIENT ALIGNMENTS
By using Lemma 8 and 10, we can show the gradients obtained from the optimize-then-discrete and discrete-then-optimize as the depth L → ∞. Observe that for any x, we have
∥∇ v f L -∇ v f θ ∥ = σ √ n ∥ϕ(h L ) -ϕ(h(T )∥ ≤ σ √ n • CL -1 • √ n ≤ CL -1 , ∥∇ W f L -∇ W f θ ∥ = T 0 1 √ n ∂f ∂h t ϕ(h t )dt - L ℓ=1 T L 1 √ n ∂f L ∂h ℓ ϕ(h ℓ-1 ) ≤ 1 √ n L ℓ=1 t ℓ t ℓ-1 ∂f ∂h t ϕ(h t ) - ∂f L ∂h ℓ ϕ(h ℓ-1 ) dt ≤ 1 √ n L ℓ=1 t ℓ t ℓ-1 ∂f ∂h t - ∂f L ∂h ℓ ∥h t ∥ + ∥ ∂f L ∂h ℓ ∥∥h t -h ℓ-1 ∥dt ≤ C √ n L ℓ=1 t ℓ t ℓ-1 √ nL -1 dt ≤C L ℓ=1 L -2 = CL -1 , ∥∇ U f L -∇ U f θ ∥ ≤ σ √ d ∥x∥∥λ 0 -λ 0 ∥ ≤ CL -1 .
Hence, combining the three results together prove Propostion 2.

Section: E.3 NTK FOR FINITE-DEPTH NEURAL NETWORKS
For Neural ODE define equation 1, its NTK is given by
K θ (x, x) = ⟨∇ θ f θ (x), ∇ θ f θ (x)⟩ . (49
)
As we have shown in Proposition 1 and 2, ∇ θ f θ (x) is well defined for every x ∈ S d-1 (a.s). Hence, K θ (x, x) is well defined for every x, x ∈ S d-1 . While K θ is random and varies during the training, as observed in Jacot et al. (2018), in the infinite-width limit, it converges to an explicit deterministic kernel K ∞ called limiting NTK. Hence, we will show that K ∞ is well-defined and provides its explicit form.
Recall that we use a finite-depth neural network f L θ defined in equation 10 that approximates Neural ODE f θ . As a result, we can also approximate the NTK K θ using K L θ defined as follows
K L θ (x, x) := ∇ θ f L θ (x), ∇ θ f L θ (x) . (50
)
We denote K L ∞ be the limit of K L θ as width n → ∞. In this subsection, we provide the explicit form for K L ∞ . For the convergence analysis, we leverage the Master Theorem introduced in (Yang, 2020, Theorem 7.2). This result is similar to Theorem 9 but it consider the backward information propagation, and it is reformed as follows.
Theorem 10. (Yang, 2020, Theorem 7.2) Consider a NETSOR ⊤ program that has both forward and backward computation for a given finite-depth neural network. Suppose the Gaussian random initialization and controllable activation functions for the given neural network. For any controllable ψ : R M → R, as width n → ∞, any finite collection of G-vars g α with size M satisfies
1 n n α=1 ψ(g 0 α , . . . , g M α ) a.s. → Eψ(z 0 , • • • , z M ),(51)
where {z 0 , • • • , z M } are Gaussian random variables whose mean and covariance are computed by the corresponding NETSOR ⊤ .
Published as a conference paper at ICLR 2025
Algorithm 2 ResNet f L θ Forward and Backward Computation on Input x Input: U x/ √ d : G(n) Input: W : A(n, n) Input: v : G(n) 1: h 0 := U x/ √ d : G(n) 2: for ℓ ∈ {1, 2, • • • , L} do 3:
x ℓ = ϕ(h ℓ-1 ) : H(n) 4:
g ℓ := W x ℓ / √ n : G(n) 5: h ℓ := h ℓ-1 + κ • g ℓ : G(n) 6: end for 7: x L = ϕ(h L ) : H(n) 8: dx L = v/ √ n : G(n) 9: dh L = dx L ⊙ ϕ ′ (h L ) : H(n) 10: for ℓ ∈ {L, L -1, • • • , 1} do 11: dg ℓ = κ • dh ℓ : H(n) 12: dx ℓ = W ⊤ dg ℓ / √ n : G(n) 13: dh ℓ-1 = dh ℓ + ϕ ′ (h ℓ -1) ⊙ dx ℓ : H(n) 14: end for Output: ∥x L ∥ 2 /n + L ℓ=1 dg ℓ x ℓ⊤ , dg ℓ x ℓ⊤ /n + dh 0 x ⊤ , dh 0 x ⊤ /d
As a result, this type of Tensor program is called NESTOR ⊤ and it includes additional G-vals from the backward information propagation. In our setup, to compute the gradients of f L θ defined in equation 10, the following new G-vals are introduced
dg L+1 := σ v √ n diag[ϕ ′ (h L )]v, dg ℓ := σ w √ n diag[ϕ ′ (h ℓ-1 )]W T , ∀[1, 2, • • • , L].
and the associated NESTOR ⊤ is given in Algorithm 2 In the rest of this subsection, we provide rigorous proof to show the convergence of K L θ to K L ∞ , as stated in Proposition 4. Without loss of generality, we assume σ u = σ w = 1 and σ v / √ d = 1. As θ = vec(v, W, U ), we have
K L θ (x, x) = ∇ v f L θ (x), ∇ v f L θ (x) + ∇ W f L θ (x), ∇ W f L θ (x) + ∇ U f L θ (x), ∇ U f L θ (x) .
Hence, we will show the convergence of each term. To simplify the notation, we abbreviate f := f L θ (x) and f := f L θ (x).
CONVERGENCE OF ∇ v f, ∇ v f
By using simple calculus, we have
∇ v f = ϕ(h L )/ √ n (52) ∇ h L f = v ⊙ ϕ ′ (h L )/ √ n. (53
)
By Theorem 10, we have
∇ v f, ∇ v f = 1 n ϕ(h L ) ⊤ ϕ( hL ) a.s. → Eϕ(u L )ϕ(ū L ) = C L+1,L+1 (x, x),
where u ℓ = z 0 + κ ℓ i=1 z i is a centered Gaussian random variable, z i are centered Gaussian random variables defined in Proposition 3, the convergence result follows Theorem 9.
CONVERGENCE OF ∇ W f, ∇ W f
To show the convergence, we first rewrite the forward propagation suggested by the Tensor program: for all ℓ ∈ {1, 2, • • • , L}
g ℓ = 1 √ n W x ℓ-1 h ℓ = h ℓ-1 + κg ℓ , x ℓ = ϕ(h ℓ ).
By using the chain rule, we obtain
∇ W f = 1 √ n L ℓ=1 (∇ g ℓ f ) • (x ℓ-1 ) ⊤ (54)
Then, the quantity can be written as follows:
∇ W f, ∇ W f = L ℓ,k=1 dg ℓ , dḡ k • x ℓ-1 , xk-1 /n
where dz is denoted the gradient of f w.r.t. a vector z occurred in the forward propagation.
It follows from Theorem 9 that
1 n x ℓ-1 , xk-1 = 1 n ϕ(h ℓ-1 ), ϕ( hk-1 ) a.s. → C ℓ,k (x, x).(55)
Moreover, we have
dx ℓ-1 = 1 √ n W ⊤ dg ℓ and dg ℓ = κdh ℓ = κ(dh ℓ+1 + dx ℓ ⊙ ϕ ′ (h ℓ )) = dg ℓ+1 + κ dx ℓ ⊙ ϕ ′ (h ℓ )
Repeat this recursive relation, and we obtain
dg ℓ =κ L i=ℓ dx i ⊙ ϕ ′ (h i )(56)
By Theorem 10 or Yang (2020), it is equivalent to consider the coordinates in dx ℓ-1 are asymptotically i.i.d. following some centered Gaussian random variables which satisfies:
E[Z dx ℓ-1 Z dx k-1 ] =κ 2 E   ℓ,k i,j=L Z dx i Zdx j ϕ ′ (u i )ϕ ′ (ū j )   =κ 2 ℓ,k i,j=L E Z dx i Zdx j E[ϕ ′ (u i )ϕ ′ (ū j )]
Hence, we obtain
D ℓ,k (x, x) = κ 2 ℓ+1,k+1 i,j=L D i,j (x, x)E[ϕ ′ (u i )ϕ ′ (ū j )](57)
As a result, we have
∇ W f, ∇ W f a.s. -→ κ 2 L ℓ,k=1 C ℓ,k (x, x)D ℓ,k (x, x) (58) CONVERGENCE OF ∇ U f, ∇ U f As h 0 = U x, h 0 i = d j=1 U ij x j implies ∂h 0 k /∂U ij = δ k,i x j . Observe that ∇ U f, ∇ U f = i,j ∂f ∂U ij ∂ f U ij = ij α ∂h 0 α ∂U ij ∂f ∂h 0 α   β ∂ h0 β ∂U ij ∂ f ∂ h0 β   = α,β ∂f ∂h 0 α ∂ f ∂ h0 β i,j ∂h 0 α ∂U ij ∂ h0 β ∂U ij = α,β ∂f ∂h 0 α ∂ f ∂ h0 β i,j δ α,i x j δ β,i xj = α,β ∂f ∂h 0 α ∂ f ∂ h0 β • δ α,β x T x = α ∂f ∂h 0 α ∂ f ∂ h0 α • x T x a.s. → D 0,0 (x, x)C 0,0 (x, x),
where C 0,0 (x, x) = x T x.
Putting everything together yields
∇ θ f, ∇ θ f = ∇ v f, ∇ v f + ∇ W f, ∇ W f + ∇ U f, ∇ U f a.s. -→ C L+1,L+1 (x, x) + L ℓ,k=1
C ℓ,k (x, x)D ℓ,k (x, x) + C 0,0 (x, x)D 0,0 (x, x)
Hence, we obtain K L θ (x, x) converges a.s. to K L ∞ (x, x) defined as follows
K L ∞ (x, x) = C L+1,L+1 (x, x) + L ℓ,k=1
C ℓ,k (x, x)D ℓ,k (x, x) + C 0,0 (x, x)D 0,0 (x, x).

Section: E.4 NTK FOR NEURAL ODES
In the previous subsection, we have shown the NTK K L θ converges to a deterministic limiting NTK K L ∞ as the width n → ∞. In this subsection, in the same limit, we will show the NTK K θ of Neural ODE f θ defined in equation 1 converges to the limiting NTK K ∞ .
Similar to the NNGP kernel Σ * , the NTK K ∞ can be considered as the limit of a double sequence:
K ∞ (x, x) = lim n→∞ ∇ θ f θ , ∇ θ fθ = lim n→∞ lim L→∞ ∇ θ f L θ , ∇ θ f L θ We have shown lim n→∞ ∇ θ f L θ , ∇ θ f L θ = K L ∞ (x, x
) in the previous subsection. Hence, the convergence of K ∞ is equivalent to show the two indices, i.e., depth and width, are interchangeable. Fortunately, if the activation function ϕ is sufficiently smooth, the two indices are indeed swappable and so the NTK K ∞ is well defined.
Based on Moore-Osgood Theorem stated in Theorem 8, a double sequence has well defined iterated limits that are equal to the double limit if the double sequence converges in one index and uniformly in the other. Hence, we will show the NTK K L θ as the double sequence converges in depth L and uniformly with respect to the width n.
Proof. Without loss of generality, we will assume σ v = σ w = 1 and σ u / √ d = 1. Observe that
K θ (x, x) = ⟨∇ v f θ (x), ∇ v f θ (x)⟩ + ⟨∇ W f θ (x), ∇ W f θ (x)⟩ + ⟨∇ U f θ (x), ∇ U f θ (x)⟩ .
Hence, the rest proof is to establish the convergence rate for each term in the summation.
Note that
∇ v f L (x), ∇ v f L (x) -⟨∇ v f θ (x), ∇ v f θ (x)⟩ = 1 n ϕ(h L (x)), ϕ(h L (x)) - 1 n ⟨ϕ(h(x, T )), ϕ(h(x, T ))⟩ = 1 n ϕ(h L (x)), ϕ(h L (x)) -ϕ(h(x, T )) + 1 n ϕ(h L (x)) -ϕ(h(x, T )), ϕ(h(x, T )) ≤ L 2 1 n ∥h L (x)∥∥h L (x) -h(x, T )∥ + L 2 1 n ∥h L (x) -h(x, T )∥∥h(x, T )∥ ≤ 1 n C √ n • √ nL -1 =CL -1 ,
where we use Lipschitz continuous of ϕ and Lemma 8.
Next, we can first show ∥∇ W f ∥ and ∥∇ W f L ∥ are upper bounded by some constants as long as T < ∞. Observe that
∥∇ W f (x)∥ =∥ T 0 1 √ n λ t ϕ(h t )dt∥ ≤∥ T 0 1 √ n • e Cσ(T -t) • √ ne Cσt dt∥ ≤CσT e CσT ,
where we use Lemma 8 and 10.
Similarly, we have
∥∇ W f L (x)∥ =∥ L ℓ=1 T L 1 √ n ∂f L ∂h ℓ ϕ(h ℓ-1 )∥ ≤ T L L ℓ=1 1 √ n ∥ ∂f L ∂h ℓ ∥∥h ℓ-1 ∥ ≤ T L L ℓ=1 1 √ n • (1 + σT /L) L-ℓ • (1 + σT /L) ℓ-1 • Cσ √ n ≤CσT e σT ,
where we have the facts
∥h ℓ ∥ ≤ (1 + σT /L) ℓ ∥h 0 ∥,(59)
∥ ∂f L ∂h ℓ ∥ ≤ (1 + σT /L) L-ℓ ∥∂f L /∂h L ∥, (60
) for all ℓ ∈ {0, 1, • • • , L}.
Additionally, it follows from Lemma 8 and 10 that
∥∇ W f L (x) -∇ W f θ (x)∥ = T 0 1 √ n ∂f ∂h t ϕ(h t )dt - L ℓ=1 T L 1 √ n ∂f L ∂h ℓ ϕ(h ℓ-1 ) ≤ 1 √ n L ℓ=1 t ℓ t ℓ-1 ∂f ∂h t ϕ(h t ) - ∂f L ∂h ℓ ϕ(h ℓ-1 ) dt ≤ 1 √ n L ℓ=1 t ℓ t ℓ-1 ∂f ∂h t - ∂f L ∂h ℓ ∥h t ∥ + ∥ ∂f L ∂h ℓ ∥∥h t -h ℓ-1 ∥dt ≤ C √ n L ℓ=1 t ℓ t ℓ-1 √ nL -1 dt ≤C L ℓ=1 L -2 = CL -1 .
Hence, we obtain
∇ W f L (x), ∇ W f L (x) -⟨∇ W f θ (x), ∇ W f θ (x)⟩ ≤ ∇ W f L (x), ∇ W f L (x) -∇ W f θ (x) + ∇ W f L (x) -∇ W f θ (x), ∇ W f θ (x) ≤∥∇ W f L (x)∥ • ∥∇ W f L (x) -∇ W f θ (x)∥ + ∥∇ W f L (x) -∇ W f θ (x)∥∥∇ W f θ (x)∥ ≤CL -1 ,
or equivalently
∇ W f L (x), ∇ W f L (x) -⟨∇ W f θ (x), ∇ W f θ (x)⟩ ≤ CL -1 . (61
)
Next, observe that
∇ U f L (x), ∇ U f L (x) -⟨∇ U f θ (x), ∇ U f θ (x)⟩ = ⟨x, x⟩ ∂f L (x) ∂h 0 (x) , ∂f L (x) ∂h 0 (x) -⟨x, x⟩ ∂f θ (x) ∂h(x, 0) , ∂f θ (x) ∂h(x, 0) .
Then we have
∂f L (x) ∂h 0 (x) , ∂f L (x) ∂h 0 (x) - ∂f θ (x) ∂h(x, 0) , ∂f θ (x) ∂h(x, 0) ≤ ∂f L (x) ∂h 0 (x) , ∂f L (x) ∂h 0 (x) - ∂f θ (x) ∂h(x, 0) + ∂f L (x) ∂h 0 (x) - ∂f θ (x) ∂h(x, 0) , ∂f θ (x) ∂h(0, x) ≤∥ ∂f L (x) ∂h 0 (x) ∥ • ∥ ∂f L (x) ∂h 0 (x) - ∂f θ (x) ∂h(x, 0) ∥ + ∥ ∂f L (x) ∂h 0 (x) - ∂f θ (x) ∂h(x, 0) ∥ • ∥ ∂f θ (x) ∂h(0, x) ∥ ≤CL -1 ,
where we use the Lipschitz smoothness of ϕ ′ and Lemma 10. Therefore, we have
∇ U f L (x), ∇ U f L (x) -⟨∇ U f θ (x), ∇ U f θ (x)⟩ ≤ CL -1 .
Then putting everything together yields
∇ θ f L (x), ∇ θ f L (x) -⟨∇ θ f θ (x), ∇ θ f θ (x)⟩ ≤ CL -1 . (62
)
Therefore, it converges uniformly in L and uniformly in n.
Combining Lemma 2 with Proposition 4 and Moore-Osgood Theorem 8, we can switch L and n in the double sequence K θ (x, x) and obtain the desired result
K ∞ (x, x) = lim n→∞ K θ (x, x) = lim n→∞ ∇ θ f θ , ∇ θ fθ = lim n→∞ lim L→∞ ∇ θ f L θ , ∇ θ f L θ = lim L→∞ lim n→∞ ∇ θ f L θ , ∇ θ f L θ = lim L→∞ K L ∞ (x, x).

Section: E.5 INTEGRAL FORM OF NNGP AND NTK
In this subsection, we provide the explicit form of the NNGP and NTK of Neural ODEs as the limits of Σ L and K L ∞ . It follows from Proposition 5 and Lemma 8 that
Σ 0,t (x, x) =δ 0,t σ 2 u d x ⊤ x, ∀t ∈ [0, T ] (63) Σ t,s (x, x) =σ 2 w Eϕ(u t )ϕ(ū s ), ∀t, s ∈ [0, T ],(64)
where (u t , ūs ) are centered Gaussian random variables with covariance
E(u t , ūs ) = Σ 0,0 + t 0 s 0 Σ t ′ ,s ′ (x, x)dt ′ ds ′ .(65)
Hence, the NNGP kernel of Neural ODE is given by
Σ * (x, x) = Σ T,T (v, x) = σ 2 v Eϕ(u T )ϕ(ū T )(66)
For the NTK of Neural ODEs, we have
K t,s (x, x) = T t T s K t ′ ,s ′ (x, x) Σt ′ ,s ′ (x, x)dt ′ ds ′ , (67
)
where Σt ′ ,s ′ (x, x) := Eϕ ′ (u t ′ )ϕ ′ (ū s ′ ). As a result, the NTK of Neural ODE is given by
K ∞ (x, x) = Σ * (x, x) + T 0 T 0 Σ t,s (x, x)K t,s (x, x)dtds + Σ 0,0 (x, x)K 0,0 (x, x).(68)

Section: F STRICT POSITIVE DEFINITENESS OF NEURAL ODE'S NTK
In this subsection, we will prove the NTK K ∞ of Neural ODEs are strictly positive definite. We first recall the definition of strict positive definite for a kernel function.
Definition 2. A kernel function k : X × X → R is strictly positive definite (SPD) if, for any finite set of distinct points x 1 , • • • , x N ∈ X, the symmetric matrix K = [k(x i , x j )] N i,j=1
is strictly positive definite, i.e., c ⊤ Kc > 0 for all nonzero vector c.

Section: Recall that
K θ (x, x) = ⟨∇ v f θ (x), ∇ v f θ (x)⟩ + ⟨∇ W f θ (x), ∇ W f θ (x)⟩ + ⟨∇ U f θ (x), ∇ U f θ (x)⟩ .
In Theorem 2, we have shown that K θ (x, x) → K ∞ (x, x) as n → ∞, provided ϕ is sufficient smooth, and
⟨∇ v f θ (x), ∇ v f θ (x)⟩ → Σ * (x, x). Hence, to show K ∞ is SPD, it is sufficient to show Σ * is SPD.
Moreover, it follows from Theorem 1 that lim L→∞ Σ L (x, x) = Σ * (x, x). We first show Σ L is SPD.

Section: F.1 DUAL ACTIVATION AND SPD OF FINITE-DEPTH NETWORK'S NNGP KERNEL
We first provide the result for finite-depth network f L θ defined by 10, where the depth
L < ∞. Proposition 6. Suppose ϕ is L 1 -Lipschitz continuous. If ϕ is non-polynomial nonlinear, then Σ L is SPD on S d-1 for 1 ≤ L < ∞.
The proof is based on the concept of dual activation and Hermitian expansion. Here a brief introduction is provided as follows. For details, we refer readers to Appendices from (Gao et al., 2021;Daniely et al., 2016).
Let x ∼ N (0, 1) and f : R → R be a real-valued function. We can define an inner product using expectation:
⟨f, g⟩ := E x∼N (0,1) f (x)g(x).
Thus, we can further define a Hilbert space of functions H, that is, f ∈ H if and only if
∥f ∥ 2 = ⟨f, f ⟩ = E x∼N (0,1) |f (x)| 2 < ∞.
Apply Gram-Schmidt process to the polynomial functions {1, x, x 2 , • • • , } w.r.t. to the inner product we defined before, and we obtain {h n } the (normalized) Hermite polynomials that is an orthonormal basis to the Hilbert space H:
h n (x) = (-1) n e x 2 2 d n dx n e -x 2 2 ,
The dual activation φ : [-1, 1] → R of an activation function ϕ is defined by
φ(ρ) := E (u,v)∼Nρ ϕ(u)ϕ(v).
where N ρ is multidimensional Gaussian distribution with mean 0 and covariance matrix 1 ρ ρ 1 .
Then the dual kernel K ϕ is defined over the unit sphere S d-1 : for every pair x, x ∈ S d-1 , the dual kernel
K ϕ : S d-1 × S d-1 → R is defined by K ϕ (x, x) := φ(x T x).
If a function ϕ ∈ H, we not only can obtain an expansion of ϕ by using the orthonormal basis of Hermitian polynomials but also an expansion to the dual activation φ by using the same Hermitian coefficients. As a consequence, the corresponding dual kernel K ϕ can be shown to be strict positive definite by using the Hermitian expansion. Lemma 11. (Daniely et al., 2016, Lemma 12) If ϕ ∈ H, then the Hermitian expansion is given by
ϕ(x) = ∞ n=0 a n h n (x),(69)
φ(ρ) = ∞ n=0 a 2 n ρ n . (70
)
where a n := ⟨h n , ϕ⟩ is the Hermite coefficients. Theorem 11. (Jacot et al., 2018, Theorem 3) (Gneiting, 2013, Theorem 1) For a function f :
[-1, 1] → R with f (ρ) = ∞ n=0 b n ρ n , the kernel K f : S d-1 × S d-1 → R defined by K f (x, x) := f (x T x)
is strictly positive define for any d ≥ 1 if and only if the coefficients b n > 0 for infinitely many even and odd integer n. Now, with these results, we are ready to prove the SPD of Σ L . Lemma 12. If ϕ is nonlinear and non-polynomial, then Σ 1 is SPD.
Proof. We first show Σ 1 is SPD. As Σ 0 (x, x) =
σ 2 u d ⟨x, x⟩ and we have Σ 1 (x, x) = σ 2 w E (u,v)∼N (0,G 0 ) [ϕ(u)ϕ(v)]
, where
G 0 = σ 2 u d 1 ⟨x, x⟩ ⟨x, x⟩ 1 .
By the notion of dual activation, we have
Σ 1 (x, x) = σ 2 w μ(x T x),
where µ(x)
:= ϕ(σ u x/ √ d).
Clearly, µ is Lipschitz continuous since ϕ is. Then µ ∈ H and let the expansion of µ in Hermite polynomials {h n } ∞ n=0 to be given as µ = ∞ n=0 a n h n , where a n = ⟨µ, h n ⟩ are the Hermitian coefficients. Then we can write μ as μ(ρ) = ∞ n=0 a 2 n ρ n and we have
Σ 1 (x, x) = σ 2 w μ(x T x) = σ 2 w ∞ n=0 a 2 n (x T x) n .
Note that µ is non-polynomial if and only if ϕ is non-polynomial. As we assume ϕ is nonpolynomial, we have µ is non-polynomial, hence there are infinitely many number of nonzero a n in the expansion. That indicates b n := a 2 n > 0 for infinitely many even and odd numbers. As σ 2 w > 0, we have Σ 1 is strictly positive definite.
Next, we can show if Σ L is SPD, then Σ L+1 is also SPD for all L ≥ 1. Lemma 13. Suppose nonlinear non-polynomial ϕ. Given L < ∞, then
1. E[u ℓ ūℓ ] = C 0,0 (x, x) + κ 2 ℓ i,j=1 C i,j (x, x) is SPD for all ℓ ∈ {1, 2, • • • , L + 1},
2. Σ L is also SPD.
Proof. As we are working with finite-depth network f L θ , it is fine to assume κ = 1 to simplify the notations. Then Σ ℓ and Σ L have the recurrent relation, stated in Proposition 3, and so as C ℓ,k and C L,K . By Theorem 12, we have C 1,1 is SPD. Additionally, we have
C 1,ℓ (x, x) = Eϕ(u 0 )ϕ(ū 1 ) = Eϕ(u 0 )ϕ(ū 0 ) = C 1,1 ,
where we use the fact
E[z 0 zℓ ] = δ 0,ℓ C 0,0 (x, x). Thus, C 1,ℓ is SPD for all ℓ. Recall that E[u ℓ ūk ] = C 0,0 (x, x) + ℓ i=1 k j=1 C i,j (x, x). Using this relation, we can write E[u ℓ ūℓ ] = C 0,0 (x, x) + C 1,1 (x, x) + 2 ℓ i=2 C 1,i (x, x) + ℓ i,j=2 C i,j (x, x).
As C 1,i is SPD for all i, the symmetry of C i,j implies E[u ℓ ūℓ ] is SPD. Now, assume the contrary, i.e., Σ ℓ+1 = C ℓ+1,ℓ+1 is not SPD. Then there exists distinct
{x 1 , • • • , x N } and nonzero a ∈ R N such that 0 = N i,j=1 a i a j C ℓ+1,ℓ+1 (x i , x j ) = i,j a i a j E[ϕ(u ℓ i )ϕ(u ℓ j )] = E N i=1 a i ϕ(u ℓ i ) 2 .
We must have i a i ϕ(u ℓ i ) = 0. As we already show
u ℓ := (u ℓ 1 , • • • , u ℓ N ) ∈ R N is
a non-degenerate Gaussian random variables, nonlinearity of ϕ implies a = 0, which contradicts a ̸ = 0. Hence, Σ ℓ+1 = C ℓ+1,ℓ+1 is SPD.

Section: F.2 STRICT POSITIVE DEFINITENESS OF NEURAL ODE'S NNGP KERNEL
Observe that the previous result uses induction to show Σ L is SPD. However, the strict positive definiteness of Σ L might diminish as L →. To address this, we conduct a fine-grained analysis of the properties of Σ L and demonstrate that these properties persist when L → ∞. Consequently, Σ * retains these essential properties, which are crucial for proving that Σ * is SPD.
Recall from Theorem 1 that
Σ * (x, x) = E[ϕ(u * )ϕ(ū * )]
, where (u * , ū * ) are centered Gaussian random variables with covariance S * (x, x) defined as the limit of S L (x, x), i.e.,
S L (x, x) = C 0,0 (x, x) + κ 2 L ℓ,k=1 C ℓ,k (x, x) → S * (x, x), as L → ∞.(71)
Based on the proof of Theorem 1, S * is well defined. Some essential properties of S L and S * are given as follows.
Lemma 14. Suppose L < ∞. For any x, x ∈ S d-1 , we have
1. S L (x, x) = S L (x, x)
2. S L (x, x) ≥ S L (x, x) and the equality holds if and only if x = x
Proof. As L < ∞, we can assume κ = 1 for simplicity. To prove the result, we make the inductive hypothesis that C ℓ,k (x, x) = C ℓ,k (x, x) for all ℓ, k ≤ L. Then observe that
S L+1 (x, x) = S L (x, x) + 2 L ℓ=1 C ℓ,L+1 (x, x) + C L+1,L+1 (x, x).
Using the inductive hypothesis, for any ℓ ∈ {1, 2,
• • • , L + 1} we have C ℓ,L+1 (x, x) = Eϕ(u ℓ-1 )ϕ(u L ) = Eϕ(ū ℓ-1 )ϕ(ū L ) = C ℓ,L+1 (x, x),
where (u ℓ-1 , u L ) are centered Gaussian random variables with covariance
E[u ℓ-1 u L ] = C 0,0 (x, x) + ℓ-1,L i,j=1 C i,j (x, x) = C 0,0 (x, x) + ℓ-1,L i,j=1 C i,j (x, x) = E[ū ℓ-1 ūL ].
This shows S L+1 (x, x) = S L+1 (x, x) and also C ℓ,k (x, x) = C ℓ,k (x, x) for all ℓ, k ≤ L + 1.
Next, using C ℓ,k (x, x) = C ℓ,k (x, x), we have
S L (x, x) -S L (x, x) = 1 2 ∥x -x∥ 2 + 1 2 E g L (x) -g L (x) 2 ,
where the function g L (x) := κ L ℓ=1 ϕ(u ℓ ). This indicates S L (x, x) ≥ S L (x, x) and the equality holds if and only if x = x. Corollary 2. For any x, x ∈ S d-1 , we have
1. 0 < S * (x, x) = S * (x, x) < ∞ 2. S * (x, x) ≥ S * (x, x
) and the equality holds if and only if x = x.
Proof. Observe that S * (x, x) = x T x + E [g(x)g(x)] ,
where g(x) := lim
L→∞ g L (x) for g L (x) = L -1 L ℓ=1 ϕ(u ℓ )
. By Lemma 17, we obtain g L (x) = O (1) uniform in L. Hence, g(x) is well defined and |g(x)| = O (1). Therefore, we obtain S * (x, x) = Θ(1). Additionally, it follows from the relation of S L (x, x) -S L (x, x) that
S * (x, x) -S * (x, x) = 1 2 ∥x -x∥ 2 + 1 2 E |g(x) -g(x)| 2 ,
which allows us to obtain the second result. Now, we are ready to prove the SPD of Σ * . Lemma 15. If ϕ is nonlinear and non-polynomial, then Σ * is SPD.
Proof. As ϕ is Lipschitz continuous, we can use Hermitian expansion to rewrite Σ * :
Σ * (x, x) = E (u,ū)∼S * (x,x) [ϕ(u)ϕ(ū)] = ∞ n=0 a 2 n [S * (x, x)/S 0 ] n ,
where we use (u, ū) ∼ S * (x, x) to denote centered Gaussian random variables with covariance computed using kernel S * (x, x), a n is the Hermitian coefficients of function ψ(u) := ϕ( √ S 0 u) with S 0 := S * (x, x), and we also use the facts S 0 = S * (x, x) = S * (x, x) for all x, x and S 0 = Θ(1) from Corollary 2.

Section: Suppose we are given any finite distinct {x
i } N i=1 from S d-1 and nonzero c ∈ R N . Observe that N i,j=1 c i c j Σ * (x i , x j ) = ∞ n=0 a 2 n S -n 0 N i,j=1 c i c j [S * (x i , x j )] n = ∞ n=0 a 2 n S -n 0 N i,j=1 c i c j x T i x j + Eg(x i )g(x j ) n ,
where we use S * (x, x) = x T x + Eg(x)g(x). By using fundamental properties for positive definite matrices from linear algebra, we have
N i,j=1 c i c j x T i x j + Eg(x i )g(x j ) n =c T (XX T + Eg(X)g(X) T ) ⊙n c ≥c T (XX T ) ⊙n c = N i,j=1 c i c j x T i x j n ,
where ⊙ is Hadamard product. Then we obtain
N i,j=1 c i c j Σ * (x i , x j ) ≥ ∞ n=0 a 2 n S -n 0 N i,j=1 c i c j x T i x j n = N i,j=1 c i c j ∞ n=0 a 2 n (x T i x j /S 0 ) n = N i,j=1 c i c j E (u,ū)∼x T i xj /S0 [ψ(u)ψ(ū)] = N i,j=1 c i c j E (u,ū)∼x T i xj [ψ(u/ S 0 )ψ(ū/ S 0 )] = N i,j=1 c i c j E (u,ū)∼x T i xj [ϕ(u)ϕ(ū)] = N i,j=1 c i c j Σ 1 (x i , x j ),
where we use the definitions of Hermitian coefficients a n and ψ. By Lemma 12, Σ 1 is SPD and so Σ * is also SPD.
As a corollary result, we have the NTK of Neural ODE is also SPD. Corollary 3. Suppose ϕ and ϕ ′ are nonlinear Lipschitz continuous. If ϕ is non-polynomial, then the NTK K ∞ of Neural ODE is SPD.
Theorem 12. Let {x i , y i } N i=1 be a training set. Assume 1. x i ∈ S d-1 , |y i | ≤ 1, and x i ̸ = x j for all i ̸ = j.
2. the activation ϕ is L 1 -Lipschitz nonlinear continuous, but non-polynomial, 3. its derivative ϕ ′ are L 2 -Lipschitz nonlinear continuous, 4. and we choose the learning rate η ≤ 1/∥X∥ 2 .
For any δ > 0, there exists a natural number n δ such that for all n ≥ n δ the parameter θ k stays in a neighborhood of θ 0 , i.e., ∥θ k -θ 0 ∥ ≤ C∥X∥∥u 0 -y∥/λ 0 , (72) and the loss function L(θ k ) consistently decrease to zero at an exponential rate, i.e.,
L(θ k ) ≤ 1 - ηλ 0 16 k L(θ 0 ),(73)
where C > 0 is some constant only depends on L 1 , L 2 , σ v , σ w , σ u , and T .
Proof. Given a distinct {x i } N i=1 , we consider the limiting NTK matrix
H ∞ ∈ R N ×N defined as H ∞ ij = K ∞ (x i , x j ).
As ϕ is non-polynomial, we have λ 0 := λ min {H ∞ } > 0. Let θ 0 denote the parameters at initialization and H(0) ∈ R N ×N be the corresponding NTK computed by θ 0 at initialization. By Theorem 2, we have H(0) converges a.s. to H ∞ , as the width n → ∞. Then for any δ 0 > 0, there exists a natural number n 0 such that with probability at least (1-δ 0 ) over random initialization λ min {H(0)} ≥ λ 0 /2 for all n ≥ n 0 . By Lemma 19, there exists another natural number n 1 such that with probability at least (1-δ 0 ), the initial residual ∥u 0 -y∥ ≤ σ * 2N log N/δ for all n ≥ n 1 . Therefore, for any δ > 0, we choose δ 0 = δ/2, and it follows from Lemma 16 that, with probability at least (1 -δ) over random initialization, we have
∥v k -v 0 ∥, ∥W k -W 0 ∥, ∥U k -U 0 ∥ ≤ C∥X∥∥u 0 -y∥/λ 0 ,and
∥u k -y∥ ≤ 1 - ηλ 0 16 k ∥u 0 -y∥, for all n ≥ max n 0 , n 1 , C 0 N 3 log(N/δ)/λ 3 0 .

Section: G GLOBAL CONVERGENCE OF NEURAL ODES
In this section, we provide the convergence analysis of Neural ODEs defined equation 1 under gradient descent.
As we use square loss, the loss function is given by
L(θ) := N i=1 1 2 (f θ (x i ) -y i ) 2 . (74
)
By using the vectorization form equation 35 and chain rule, the gradients are given by
∂L(θ) ∂v = N i=1 σ v √ n ϕ(h T (x i ))(f θ (x i ) -y i ),(75)
∂L(θ) ∂W = N i=1 T 0 σ w √ n ϕ(h t (x i )) ⊗ λ t (x i )dt (f θ (x i ) -y i ),(76)
∂L(θ) ∂U = N i=1 σ u √ d [x i ⊗ λ 0 (x i )] (f θ (x i ) -y i ).(77)
Consider the gradient descent
θ k+1 = θ k -η ∂L(θ k ) ∂θ .(78)
Assume the inductive hypothesis: For all i ≤ k, there exist some constants α v , α w , α u > 0 such that
1. ∥v i ∥, ∥W i ∥, ∥U i ∥ ≤ C √ n, 2. ∥u i -y∥ ≤ (1 -ηα 2 0 ) i ∥u 0 -y∥,
where C > 0 is a constant and
α 0 := σ min σv √ n Φ 0 .
Without loss generality, we assume
σ v = 1, σ w = σ, σ u / √ d = 1 and L 1 = L 2 = 1.
Observe that
∥ ∂f θ ∂v ∥ = ∥ 1 √ n ϕ(h T )∥ ≤ 1 √ n ∥U ∥∥x∥e σT ∥W ∥/ √ n .
Note that
∥ ∂f θ ∂W ∥ ≤ σ √ n T 0 ∥ϕ(h t )∥∥λ t ∥dt ≤ σ √ n T 0 ∥U ∥∥x∥e σt∥W ∥/ √ n • ∥v∥ √ n e σ(T -t)∥W ∥/ √ n dt =(σT ) ∥U ∥ √ n ∥v∥ √ n ∥x∥e σT ∥W ∥/ √ n .
Observe that
∥ ∂f θ ∂U ∥ ≤ ∥x∥∥λ 0 ∥ ≤ ∥x∥ • ∥v∥ √ n exp σT ∥W ∥/ √ n
By using the inductive hypothesis, we obtain
∥ ∂f θ ∂v ∥ ≤ Ce CσT ∥x∥,(79)
∥ ∂f θ ∂W ∥ ≤ (σT )Ce CσT ∥x∥,(80)
∥ ∂f θ ∂U ∥ ≤ Ce CσT ∥x∥.(81)
Then we obtain
∥v k+1 -v 0 ∥ ≤η k i=0 ∥ ∂L(θ i ) ∂v ∥ ≤η k i=0 Ce CσT ∥X∥∥u i -y∥ ≤ηCe CσT ∥X∥ k i=0 (1 -ηα 2 0 ) i ∥u 0 -y∥ ≤Ce CσT ∥X∥∥u 0 -y∥/α 2 0
Note that the RHS is an constant after initialization. If we assume ∥x∥ = 1 and |y| = 1, then we need to ensure
Ce CσT ∥X∥∥u 0 -y∥/α 2 0 ≤ C √ n.(82)
And as a result, we have
∥v k+1 ∥ ≤ ∥v k+1 -v 0 ∥ + ∥v 0 ∥ ≤ C √ n. Similarly, we have ∥W k+1 -W 0 ∥ ≤η k i=0 ∥ ∂L(θ i ) ∂W ∥ ≤η k i=0 (σT )Ce CσT ∥X∥∥u i -y∥ ≤η(σT )Ce CσT ∥X∥ k i=0 (1 -ηα 2 0 )∥u 0 -y∥ ≤(σT )Ce CσT ∥X∥∥u 0 -y∥/α 2 0 . Then we need to ensure (σT )Ce CσT ∥X∥∥u 0 -y∥/α 2 0 ≤ C √ n.
(83) Then we obtain
∥W k+1 ∥ ≤ ∥W k+1 -W 0 ∥ + ∥W 0 ∥ ≤ C √ n. Observe that ∥U k+1 -U 0 ∥ ≤η k i=0 ∥ ∂L(θ i ) ∂U ∥ ≤η k i=0 Ce CσT ∥X∥∥u i -y∥ ≤ηCe CσT ∥X∥ k i=0 (1 -ηα 2 0 ) i ∥u 0 -y∥ ≤Ce CσT ∥X∥∥u 0 -y∥/α 2 0 . Hence, we obtain ∥U k+1 ∥ ≤ ∥U k+1 -U 0 ∥ + ∥U 0 ∥ ≤ C √ n.
Next, observe that
u k+1 -y =u k+1 -u k + (u k -y) = ∂ ũ ∂θ ⊤ (θ k+1 -θ k ) + (u k -y) = ∂ ũ ∂θ ⊤ -η ∂u k ∂θ (u k -y) + (u k -y) = I -η ∂ ũ ∂θ ⊤ ∂u k ∂θ (u k -y) = I -η ∂u k ∂θ ⊤ ∂u k ∂θ (u k -y) + η ∂u k ∂θ - ∂ ũ ∂θ ⊤ ∂u k ∂θ (u k -y)
where ũ = u( θ) and θ is an interpolation in between θ k and θ k+1 .
Note that
∥ ∂f ∂v - ∂ f ∂v ∥ =∥ 1 √ n ϕ(h T ) - 1 √ n ϕ( hT )∥ ≤ 1 √ n ∥h T -hT ∥ ≤ C √ n ∥θ -θ∥e CσT ∥x∥
where we use the Lemma and the inductive hypotheses.
Similarly, note that
∥ ∂f ∂W - ∂ f ∂W ∥ ≤ σ √ n ∥ T 0 ϕ(h t ) ⊗ λ t -ϕ( ht ) ⊗ λt dt∥ ≤ σ √ n T 0 ∥h t -ht ∥∥λ t ∥ + ∥ ht ∥∥λ t -λt ∥ dt ≤C σ √ n T 0 ∥θ -θ∥e Cσt ∥x∥ • e Cσ(T -t) dt ≤(σT ) C √ n ∥θ -θ∥e CσT ∥x∥.
and
∥ ∂f ∂U - ∂ f ∂U ∥ ≤ ∥x∥∥λ 0 -λ0 ∥ ≤ C √ n ∥θ -θ∥e CσT ∥x∥.
Hence, we have
∥ ∂f ∂θ - ∂ f ∂θ ∥ = ∥ ∂f ∂v - ∂ f ∂v ∥ + ∥ ∂f ∂W - ∂ f ∂W ∥ + ∥ ∂f ∂U - ∂ f ∂U ∥ ≤ (σT ) C √ n ∥θ -θ∥e CσT ∥x∥. Then ∥ ∂u k ∂θ - ∂ ũ ∂θ ∥ ≤ (σT ) C √ n ∥θ k -θ∥e CσT ∥X∥ ≤ (σT ) C √ n ∥θ k -θ k+1 ∥e CσT ∥X∥,
where we use the fact θ = αθ k + (1 -α)θ k+1 for some α ∈ [0, 1].
Observe that
∥θ k+1 -θ k ∥ = η∥ ∂L(θ k ) ∂θ ∥ = η∥ ∂u k ∂θ ⊤ (u k -y)∥ ≤ η(σT )Ce CσT ∥X∥∥u k -y∥.
Hence, we obtain
∥ ∂u k ∂θ - ∂ ũ ∂θ ∥ ≤ η(σT ) 2 C √ n e CσT ∥X∥ 2 ∥u k -y∥,and
∥ ∂u k ∂θ - ∂u 0 ∂θ ∥ ≤(σT ) C √ n ∥θ k -θ 0 ∥e CσT ∥X∥ ≤(σT ) C √ n e CσT ∥X∥ k-1 i=0 ∥θ i+1 -θ i ∥ ≤η(σT ) 2 C √ n e CσT ∥X∥ 2 k-1 i=0 ∥u i -y∥ ≤η(σT ) 2 C √ n e CσT ∥X∥ 2 k-1 i=0 (1 -ηα 2 0 ) i ∥u 0 -y∥ ≤(σT ) 2 C √ n e CσT ∥X∥ 2 ∥u 0 -y∥/α 2 0 ≤α 0 /2,
where we use the assumption
√ n ≥ C(σT ) 2 e CσT ∥X∥ 2 ∥u 0 -y∥/α 3 0 . (84
)
It follows from the Weyl's inequality that
σ min ∂u k ∂θ ≥ σ min ∂u 0 ∂θ -∥ ∂u k ∂θ - ∂u 0 ∂θ ∥ ≥ α 0 /2.
and so
λ min ∂u k ∂θ T ∂u k ∂θ ≥ α 2 0 /4.
Therefore, we obtain
∥u k+1 -y∥ ≤ 1 -ηα 2 0 /4 ∥u k -y∥ + η 2 (σT ) 3 C √ n e CσT ∥X∥ 3 ∥u k -y∥ 2 ≤ 1 -ηα 2 0 /4 + η 2 (σT ) 3 C √ n e CσT ∥X∥ 3 ∥u 0 -y∥ ∥u k -y∥ = 1 -η α 2 0 /4 -η(σT ) 3 C √ n e CσT ∥X∥ 3 ∥u 0 -y∥ ∥u k -y∥
≤ 1 -ηα 2 0 /8 ∥u k -y∥, where we assume √ n ≥ 8Cη(σT ) 3 e CσT ∥X∥ 3 ∥u 0 -y∥/α 2 0 . This finishes proving Lemma 16. Lemma 16. Assume ϕ and ϕ ′ are L 1 -and L 2 -Lipschitz continuous and λ 0 := λ min (K θ0 ) > 0. Suppose we choose the width n = Ω(∥X∥ 4 ∥u 0 -y∥ 2 /λ 3 0 ) and the learning rate η ≤ 1 ∥X∥ 2 . Then the parameters θ k stays in the neighborhood of θ 0 , i.e.,
∥v k -v 0 ∥, ∥W k -W 0 ∥, ∥U k -U 0 ∥ ≤ C∥X∥∥u 0 -y∥/λ 0 ,(85)
and the residual ∥u k -y∥ consistently decreases, i.e.,
∥u k -y∥ ≤ 1 - ηλ 0 8 k ∥u 0 -y∥,(86)
where C > 0 is some constant only depends on L 1 , L 2 , σ v , σ w , σ u , and T . Lemma 17. Given θ, we have
∥h t ∥ ≤ ∥U ∥∥x∥ exp σt √ n ∥W ∥ , (87
)
∥λ t ∥ ≤ ∥v∥ √ n exp σ(T -t) √ n ∥W ∥ , (88
)
for all t ∈ [0, T ]
Proof. Observe that
h t = h 0 + t 0 σ √ n W ϕ(h s )ds
and so
∥h t ∥ ≤ ∥h 0 ∥ + σ √ n ∥W ∥ t 0 ∥h s ∥ds
Then it follows from the Gronwall's inequality that
∥h t ∥ ≤ ∥U ∥∥x∥ exp σt √ n ∥W ∥ , ∀t ∈ [0, T ].(89)
Similarly, we have
λ t = λ T + T t - σ √ n diag[ϕ ′ (h t )]W ⊤ λ s ds implies ∥λ t ∥ ≤ ∥λ T ∥ + σ √ n L 1 ∥W ∥ T t ∥λ s ∥ds.
By the Gronwall's inequality, we obtain
∥λ t ∥ ≤∥λ T ∥ exp T t σ∥W ∥/ √ nds ≤∥λ T ∥ exp σ∥W ∥/ √ n(T -t)
.
By λ T = 1 √ n diag[ϕ ′ (h T )]v
, we obtain the final result.
Lemma 18. Given θ, θ, we have
∥h t -ht ∥ ≤∥θ -θ∥ ∥U ∥ ∥W ∥ e σt(∥W ∥+∥ W ∥)/ √ n ∥x∥ (90) ∥λ t -λt ∥ ≤∥θ -θ∥ ∥v∥ ∥W ∥ e σ(T -t)(∥W ∥+∥ W ∥)/ √ n / √ n(91)
for all t ∈ [0, T ]
Proof. Observer that
h t -ht = (h 0 -h0 ) + σ √ n t 0 W ϕ(h s ) -W ϕ( hs ) ds
Then we have
∥h t -ht ∥ ≤∥h 0 -h0 ∥ + σ √ n t 0 ∥W -W ∥∥h s ∥ + ∥ W ∥∥h s -hs ∥ ds ≤∥h 0 -h0 ∥ + σ √ n ∥W -W ∥ t 0 ∥U x∥ exp σs∥W ∥/ √ n ds + σ √ n ∥ W ∥ t 0 ∥h s -hs ∥ds
Using the bound of ∥h s ∥, we have
σ √ n ∥U x∥∥W -W ∥ t 0 exp σs∥W ∥/ √ n ds = σ √ n ∥U x∥∥W -W ∥ • σ √ n ∥W ∥ -1 e σt∥W ∥/ √ n -1 = ∥U ∥ ∥W ∥ ∥W -W ∥ e σt∥W ∥/ √ n -1 ∥x∥.
Then by Grownwall's inequality, we obtain
∥h t -ht ∥ ≤ ∥h 0 -h0 ∥ + ∥W -W ∥ ∥U ∥ ∥W ∥ e σ∥W ∥t/ √ n -1 ∥x∥ e σ∥ W ∥t/ √ n ≤ ∥U -Ū ∥ + ∥W -W ∥ ∥U ∥ ∥W ∥ e σt(∥W ∥+∥ W )∥/ √ n ∥x∥.
Then we obtain the result.
Observe that
λ t -λt = (λ T -λT ) + σ √ n T t diag[ϕ ′ (h s )]W ⊤ λ s -diag[ϕ ′ ( hs )] W ⊤ λs ds.
Published as a conference paper at ICLR 2025
Then we obtain
∥λ t -λt ∥ ≤ ∥λ T -λT ∥ + σ √ n T t ∥W -W ∥∥λ s ∥ + ∥ W ∥∥λ s -λs ∥ ds
By using the bound of ∥λ s ∥, we obtain
σ √ n ∥W -W ∥ ∥v∥ √ n T t exp σ(T -s) √ n ∥W ∥ ds ≤ σ √ n ∥W -W ∥ ∥v∥ √ n σ √ n ∥W ∥ -1 e σ(T -t)∥W ∥/ √ n -1 = 1 √ n ∥W -W ∥ ∥v∥ ∥W ∥ e σ(T -t)∥W ∥/ √ n -1 .
Then by Grownwall's inequality, we have
∥λ t -λt ∥ ≤ ∥λ T -λT ∥ + 1 √ n ∥W -W ∥ ∥v∥ ∥W ∥ e σ(T -t)∥W ∥/ √ n -1 e σ(T -t)∥ W ∥/ √ n ≤ 1 √ n ∥v -v∥ + ∥W -W ∥ ∥v∥ ∥W ∥ e σ(T -t)(∥W ∥+∥ W ∥)/ √ n
Lemma 19. Given δ > 0, there exists a natural number n δ such that for all n ≥ n δ , with probability at least 1 -δ over random initialization, we have
∥u∥ ≤ σ 2N log(N/δ),(92)
where
σ 2 := Σ * (x, x) for x ∈ S d-1 . Proof. Fix x, denote u := f θ (x) = v T ϕ(h T (x))/ √ n.
By Theorem 1, we have u converges in distribution to a centered Gaussian random variable with variance σ 2 := Σ * (x, x). Hence, given δ > 0, we have there exists n δ such that n ≥ n δ implies
|P (u ≥ ε) -P (z ≥ ε)| ≤ δ/2,
where z ∼ N (0, σ 2 ). Then we have
P (u ≥ ε) ≤ δ/2 + P (z ≥ ε) ≤ δ/2 + e -ε 2 /2σ 2 ≤ δ,
where the last inequality is due to ε := σ 2 log(2/δ). Similarly, we obtain two two-tailed bound, i.e.,
P (|u| ≥ ε) ≤ δ.
Now, denote u = f θ (X) ∈ R N as a vector. We have
P (∥u∥ ≥ ε 0 ) =P (∥u∥ 2 ≥ ε 2 0 ) = P ( N i=1 |u i | 2 ≥ ε 2 0 ) ≤ N i=1 P (|u i | 2 ≥ ε 2 0 /N ) = N i=1 P (|u i | ≥ ε 0 / √ N ) ≤δ,
where we use the fact P (
N i=1 x i ≥ ε) ≤ N i=1 P (x i ≥ ε/N ) and ε 0 := σ 2N log(N/δ).

Section: H ADDITIONAL EXPERIMENTS
In this appendix, we provide supplementary experiments that complement the results in the main paper. These experiments explore the impact of different activation functions, scaling for long time horizons, and the behavior of Neural ODEs when approximated by Gaussian processes. Additionally, we examine the behavior of the NTK when using polynomial activations.

Section: H.1 SCALING FOR LONG-TIME HORIZONS
As discussed in Proposition 1 and Proposition 2, smooth activations ensure that the forward and backward dynamics of Neural ODEs have globally unique solutions. However, extending the time range or working with long-time horizons in the dynamics can introduce difficulties for numerical solvers, leading to higher numerical errors. To understand how Neural ODEs behave over extended time horizons, we investigated their behavior at initialization as the time horizon increases, focusing on how output magnitudes and variance are affected. The objective was to understand how extending the time horizon impacts the model's outputs and the subsequent training process.
At initialization, as the time horizon T increases, the output magnitudes grow larger, resulting in increased variance, as shown in Figure 3 

Section: H.2 GAUSSIAN PROCESS APPROXIMATION
In Section 4, we established that Neural ODEs tend toward a Gaussian Process (GP) as their width increases, as demonstrated in Theorem 1. The associated NNGP kernel of this Gaussian process is non-degenerate, as stated in Lemma 5. To empirically verify these theoretical findings, we conducted a series of experiments.
First, we fixed an input x and initialized 10,000 random Neural ODEs. We then plotted the output histograms for various network widths and fitted the distributions with a Gaussian model, as shown in Figure 4. Additionally, we ran statistical tests to confirm whether the output distributions followed a Gaussian distribution. The Kolmogorov-Smirnov (KS) test statistics and p-values indicated that as long as the width exceeds 100, the outputs closely follow a Gaussian distribution.
Next, we analyzed the independence of the output neurons by plotting pairwise outputs across two coordinates. According to Theorem 1, the output neurons should become independent as the width increases. Figure 5 confirms this: while the diagonal plots show Gaussian bell shapes, the offdiagonal plots resemble random ball shapes, indicating that the neurons are uncorrelated and, therefore, independent as the width increases.
Finally, we investigated whether Neural ODEs preserve the structure of input data at the output. We constructed a matrix X of 10 samples and calculated the input covariance matrix XX ⊤ . Then, we initialized 10,000 Neural ODEs with random weights and evaluated them on the input X, computing the output covariance matrix. As shown in Figure 6, the output covariance matrix retained the correlation patterns of the input matrix but with reduced magnitudes, indicating that Neural ODEs act as structure-preserving smoothers, reducing the spread of the data while maintaining its underlying relationships.    

Section: H.4 POLYNOMIAL ACTIVATIONS FOR NTK AND GLOBAL CONVERGENCE
In this experiment, we tested quadratic activation functions to assess their impact on NTK behavior and convergence. While previous results suggested that nonlinearity but non-polynomiality is a sufficient condition for the strict positive definiteness (SPD) of the NTK, our experiments reveal that it is not a necessary condition.
We observed that the NTK of Neural ODEs using quadratic activation is also strictly positive definite, with the smallest eigenvalue slightly higher than that of Softplus, as shown in Figure 8 In terms of parameter behavior (Figure 8(c)), the parameter differences for the quadratic activation were slightly larger than those for Softplus, meaning the parameters drifted further from their initial values. However, these differences remained within the same order of magnitude, indicating that the model still satisfies the conditions for global convergence, even though it does not meet the sufficient condition of being non-polynomial.
In summary, while quadratic activation functions result in strictly positive definite NTKs similar to non-polynomial activations, they lead to slower convergence and slightly less stable parameter behavior compared to smoother activations like Softplus. This suggests that while non-polynomiality is not strictly necessary for SPD and convergence, smoother activations may offer practical benefits for faster and more stable training.

Section: H.5 CONVERGENCE ANALYSIS ON DIVERSE DATASETS
In the main paper, we focused on the convergence properties of Neural ODEs using different activation functions on the MNIST dataset. To ensure that these findings generalize across different types of data and tasks, we extended our experiments to three additional datasets: CIFAR-10 (image classification), AG News (text classification), and Daily Climate (time series forecasting). This section details the performance of three key activation functions-Softplus, ReLU, and GELU-on these datasets, highlighting their effects on convergence speed, stability, and generalization.
For each dataset, we trained Neural ODE models with different widths (i.e., 500, 1000, 2000, 3000) using Softplus, ReLU, and GELU activations. We monitored the training loss and test loss, comparing how different activations influence convergence behavior across datasets. The optimizer used was gradient descent with a learning rate of 0.1, and models were trained for 100 epochs.
For CIFAR-10, the results showed minimal differences between the activation functions.
• Softplus, ReLU, and GELU all exhibited similar convergence patterns, with larger widths leading to faster convergence across the board.
• Larger widths consistently resulted in lower training and test losses, but the specific choice of activation did not have a significant impact on the overall performance or convergence speed.
These results suggest that for CIFAR-10, the activation function choice is less critical, particularly when the network is sufficiently wide, i.e., see Figure 9. For AG News, we observed distinct convergence patterns across the activation functions:
• Softplus converged the fastest, followed by ReLU, with GELU converging the slowest. Despite GELU being a smooth activation, its derivative differs significantly compared to the other activations, which may explain the slower convergence rate.
• All three activations shared the same trend: larger widths led to faster convergence and lower test losses. However, the differences between activation functions were more pronounced at smaller widths, where GELU lagged behind (Figure 10).
This suggests that while GELU's smoothness offers theoretical benefits, in practice, its derivative may cause slower optimization dynamics, particularly for text-based tasks like AG News. In this experiment, we investigate the impact of non-smooth activation functions, specifically ReLU, on the performance of Neural ODEs and their ResNet approximations under the "Discretize-Then-Optimize" and "Optimize-Then-Discretize" frameworks. While the output differences between the two frameworks decrease as the depth L increases, our results reveal that the backward gradients fail to converge due to the non-smooth nature of ReLU's derivative.
Smooth Activation Functions (Softplus). For smooth activation functions like Softplus, both the output difference and gradient difference between the two frameworks decrease at a rate of 1/L as the depth L increases. This behavior aligns with Proposition 2 and is illustrated in Figure 12(a)-(b).
Non-Smooth Activation Functions (ReLU). In contrast, for ReLU, the output difference still decreases at a rate of 1/L, as shown in Figure 12(c). However, the gradient difference fails to converge, as illustrated in Figure 12(d). Initially, the gradient difference reduces as depth increases, but it eventually stagnates at a fixed error level. Increasing the network width does not resolve this issue. Notably, the largest gradient difference is observed at width 500, whereas smaller errors are achieved for both smaller and larger widths, such as width 200 and 1000. These results confirm that the lack of a continuous derivative in ReLU introduces inconsistencies in gradient computations between the two frameworks.
Training Dynamics. Despite this mismatch in gradient computation, we did not observe significant differences in the training dynamics between Neural ODEs and ResNets. We trained Neural ODEs and their finite-depth ResNet approximations (fixed at depth 200, as further depth increases did not reduce errors, as shown in Figure 12(d)) on a subset of MNIST. As illustrated in Figure 13, both models exhibit similar training and test losses. While the output differences remain consistently small during training, the gradient differences oscillate, as shown in Figure 13. ResNets, as finite-depth networks, are known to exhibit global convergence guarantees under gradient descent in overparameterized regimes (Du et al., 2019a), so their convergence is unsurprising. What is unexpected, however, is the near-identical training dynamics between Neural ODEs and ResNets despite the gradient mismatch caused by ReLU's non-smoothness. Our hypothesis is that while gradient differences oscillate during training, they remain within small magnitudes because MNIST is a simple dataset and ReLU's derivative is almost continuous everywhere except at the origin. This partial smoothness may mitigate the adverse effects of the gradient mismatch. However, we anticipate that in realistic applications involving more complex datasets, these differences could lead to divergent training trajectories and dynamics for Neural ODEs and ResNets using non-smooth activations. In this subsection, we investigate the impact of different numerical ODE solvers on the accuracy of gradient computation and overall training dynamics in the "Optimize-then-discretize" framework. The solvers considered in our experiments are Euler, rk4, and dopri5.
As illustrated in Figure 14(a)-(b), the choice of ODE solver does not significantly affect the accuracy of gradient computation or the overall training dynamics in our specific setting. This is consistent with the theoretical guarantees established in Proposition 1, where we demonstrated that the ODE dynamics in Equations Eq. ( 2) and Eq. ( 4) possess globally unique solutions under the smoothness conditions on activation functions. Given the relatively simple nature of the system studied, the numerical errors introduced by the solvers appear to be negligible in this context. However, this observation may not generalize to more complex systems or practical applications where numerical errors can be influenced by other factors such as stiffness or stability in the dynamics, which are beyond the scope of this paper.
An interesting observation from our experiments is the computational efficiency of the solvers. While adaptive solvers like dopri5 provide high accuracy, they require significantly more computation time as the neural network width increases. In contrast, fixed-step methods such as Euler and rk4 scale more efficiently with width, making them preferable in scenarios where computational cost is a concern. This is illustrated in Figure 14(c), where we compare the time taken by the solvers across different widths.

Section: I DISCUSSION ON GENERAL DYNAMIC FORM IN NEURAL ODES
In this section, we discuss extending our results from the specific form equation 1 and equation 2 to a more general dynamic formulation. Specifically, we first consider a generalized nonlinear transformation:
ḣt = σ w √ n W f (h t , t), ∀t ∈ [0, T ],
where the original nonlinear activation function ϕ in equation 2 is replaced by a general nonlinear mapping f : R n × R → R n , defined as f : (h, t) → f (h, t). This generalization introduces explicit time dependence, transforming the system from an autonomous to a non-autonomous system. Nonautonomous systems are prevalent in applications such as diffusion models (Song et al., 2020) and physics-informed neural networks (PINNs) (Sholokhov et al., 2023). The function f can represent another shallow neural network or more complex operations, such as convolution layers (LeCun et al., 1998), gating mechanisms (Hochreiter, 1997), attention mechanisms (Vaswani, 2017), or batch normalization (Ioffe, 2015).
2019). These advanced formulations have broadened the applicability of Neural ODEs to diverse domains, such as time-series modeling (Rubanova et al., 2019), computer vision (Chen et al., 2018b;Park et al., 2021), and reinforcement learning (Du et al., 2020). In generative modeling, Neural SDEs underpin approaches like FFJORD (Grathwohl et al., 2018), score-based methods (Song et al., 2020), and diffusion models (Ho et al., 2020). Similarly, in physics-informed machine learning, Neural PDEs and Physics-Informed Neural Networks (PINNs) have proven critical for solving physical systems while incorporating domain-specific knowledge (Sholokhov et al., 2023;Karniadakis et al., 2021;Raissi et al., 2019). However, while these features offer flexibility and efficiency, they also introduce significant challenges during training.
A key challenge in training Neural ODEs lies in gradient computation. The original adjoint method introduced by Chen et al. (2018b) computes gradients with minimal memory overhead. However, this approach can suffer from numerical instabilities, as observed in Gholaminejad et al. (2019). To address these issues, advanced methods have been developed. For instance, Zhuang et al. (2020a) integrates adjoint techniques with checkpointing to balance memory usage and computational cost, while Matsubara et al. ( 2021) employs symplectic integrators to preserve ODE structure, ensuring stability in long-time horizons and oscillatory systems. Finlay et al. (2020) regularizes the Jacobian norm of the dynamics to improve stability and generalization. Ko et al. (2023) introduces a homotopy-based approach, starting with simplified dynamics and gradually transitioning to target dynamics. These methods generally follow an "optimize-then-discretize" approach, where (augmented) backward ODEs are solved numerically to compute gradients. Conversely, the "discretizethen-optimize" approach, which discretizes the forward ODE into a finite-depth network for gradient computation via backpropagation, has been explored by Massaroli et al. (2020). However, as noted in Zhuang et al. (2020a;b), this method often results in deeper computational graphs, raising concerns about gradient accuracy.
To address the challenge in gradient computation, several theoretical studies have been conducted, focusing on well-posedness and stability. For instance, Gholaminejad et al. ( 2019) highlighted significant numerical instabilities when using ReLU activations in Neural ODEs. Meanwhile, Rodriguez et al. ( 2022) investigated the stability of Neural ODEs through a Lyapunov framework derived from control theory. Despite these advancements, none of these works address when and how the "discretize-then-optimize" and "optimize-then-discretize" methods can yield equivalent gradients. Moreover, the question of whether simple first-order optimization methods, such as stochastic gradient descent, can reliably train Neural ODEs to convergence remains unexplored.
Another essential challenge lies in analyzing the training dynamics of Neural ODEs due to the inherent nonconvexity of neural network optimization. A significant breakthrough in this area came from the Neural Tangent Kernel (NTK) framework introduced by Jacot et al. (2018), which demonstrated that the NTK governs the training dynamics of feedforward networks (FFNs) under gradient descent and converges to a deterministic limit as network width increases. This convergence facilitates global convergence guarantees for gradient-based optimization in overparameterized regimes, provided the NTK remains strictly positive definite (SPD) (Du et al., 2019a;Allen-Zhu et al., 2019;Nguyen, 2021). The strict positive definiteness of the NTK has been extensively studied, beginning with dual activation analysis for two-layer networks (Daniely et al., 2016) and later extended to finite-depth FFNs (Jacot et al., 2018;Du et al., 2019a). Recent work has further applied NTK theory to diverse architectures, including convolutional neural networks (CNNs) (Arora et al., 2019), recurrent neural networks (RNNs) (Yang, 2020), transformers (Hron et al., 2020), physics-informed neural networks (PINNs) (Wang et al., 2022), and graph neural networks (GNNs) (Du et al., 2019b). NTK analysis has also been explored for various optimization methods, such as stochastic gradient descent (SGD) (Zou et al., 2020) and adaptive gradient algorithms (Chen et al., 2018a). A few recent works start studying large-depth neural networks Gao & Gao (2022b;a); Gao (2024). However, applying NTK theory to continuous-depth models like Neural ODEs and determining whether similar SPD and convergence properties hold remains an open and active area of research.
Despite significant advancements, challenges persist in understanding the training dynamics of Neural ODEs and ensuring gradient consistency between the "discretize-then-optimize" and "optimizethen-discretize" approaches. Our work addresses these gaps by: 1. Gradient Equivalence: Establishing conditions under which the gradients computed by the two methods are equivalent, as demonstrated in Proposition 1 and Proposition 2, emphasizing the role of smooth activations.

Section: 
Published as a conference paper at ICLR 2025 For this generalized form, the backward dynamics take the form:
where J = ∂f /∂h ∈ R n×n is the Jacobian matrix of f with respect to h. By Theorem 5, the forward ODE has unique global solutions if f is continuous in t and Lipschitz continuous in h, with a Lipschitz constant independent of t. This generalizes the continuity requirement for the activation function ϕ to f . Additionally, if the Jacobian matrix J is globally bounded, the backward ODE also admits a unique global solution. Since f is Lipschitz continuous in h, the boundedness of J is naturally satisfied (on a compact set). Therefore, appropriate smoothness conditions on f ensure well-posed forward and backward dynamics with unique solutions.
Using Euler's method, we discretize the forward and backward dynamics as follows:
where κ = T /L. Ensuring convergence of (h ℓ , λ ℓ ) to (h t , λ t ) is critical to aligning the gradients obtained from the "discretize-then-optimize" and "optimize-then-discretize" methods. As discussed in Proposition 2, additional smoothness of the backward ODE is required for gradient equivalence. By Theorem 7, the mapping t → J (h t , t) must be continuous in t, which implies that J is Lipschitz continuous in h with a Lipschitz constant independent of t. The smoothness of J with respect to h can be guaranteed by imposing second-order regularity conditions on f . Specifically, bounding the Jacobian tensor ∂J /∂h under suitable norms, such as the operator norm or Frobenius norm, ensures the required regularity. Although ∂J /∂h represents a higher-order tensor, these regularity conditions allow the gradient consistency results from Proposition 2 to extend seamlessly to this generalized formulation.
Theorem 7 provides not only convergence guarantees but also a uniform convergence rate under globally uniform smoothness conditions. Consequently, by Theorem 8, the iterated limits in Lemma 1 and Lemma 2 converge to the same double limit. As a result, the NNGP and NTK of the generalized Neural ODE remain well-defined. If the limiting NNGP or NTK is strictly positive definite (SPD), global convergence under gradient descent can also be established.
Finally, we discuss extending the dynamics to a post-activation formulation:
where the linear transformation h → σw √ n W h is applied before the nonlinear mapping f . The analysis remains analogous because the linear transformation is globally 1-Lipschitz continuous under Theorem 4. However, we focus primarily on the pre-activation form, as it consistently achieves superior empirical performance compared to the post-activation formulation (He et al., 2016b).

Section: J RELATED WORKS
Neural Ordinary Differential Equations (Neural ODEs) (Chen et al., 2018b) introduced a continuousdepth framework for modeling dynamics by replacing discrete-layer transformations with parameterized differential equations. This innovative framework has since inspired extensive research, leading to both theoretical advancements and practical applications.
Neural ODEs are distinguished by their continuous-time representation and memory efficiency through parameter sharing, setting them apart from traditional architectures like ResNet (He et al., 2016a). Building on this foundation, several extensions have been proposed to address more complex systems. Notable examples include Neural Stochastic Differential Equations (SDEs) for stochastic dynamics (Li et al., 2020), Neural Partial Differential Equations (PDEs) for spatiotemporal systems (Sirignano & Spiliopoulos, 2018;Raissi et al., 2019), Neural Controlled Differential Equations (CDEs) for irregular time-series data (Kidger et al., 2020), and Neural Variational and Hamiltonian Systems for capturing conserved quantities in physical dynamics (Greydanus et al., 


References:
[b0] Zeyuan Allen-Zhu; Yuanzhi Li; Zhao Song (2019). A convergence theory for deep learning via overparameterization. PMLR
[b1] Sanjeev Arora; Simon S Du; Wei Hu; Zhiyuan Li; Ruslan Salakhutdinov; Ruosong Wang (2019). On exact computation with an infinitely wide neural net. 
[b2] Zhi-Dong Bai; Yong-Qua Yin (2008). Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. World Scientific
[b3] Jinghui Chen; Dongruo Zhou; Yiqi Tang; Ziyan Yang; Yuan Cao; Quanquan Gu (2018). Closing the generalization gap of adaptive gradient methods in training deep neural networks. 
[b4] Yulia Ricky Tq Chen; Jesse Rubanova; David K Bettencourt;  Duvenaud (2018). Neural ordinary differential equations. 
[b5] Amit Daniely; Roy Frostig; Yoram Singer (2016). Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. 
[b6] Jianzhun Du; Joseph Futoma; Finale Doshi-Velez (2020). Model-based reinforcement learning for semi-markov decision processes with neural odes. Advances in Neural Information Processing Systems
[b7] Simon Du; Jason Lee; Haochuan Li; Liwei Wang; Xiyu Zhai (2019). Gradient descent finds global minima of deep neural networks. PMLR
[b8] Kangcheng Simon S Du;  Hou; Barnabas Russ R Salakhutdinov; Ruosong Poczos; Keyulu Wang;  Xu (2019). Graph neural tangent kernel: Fusing graph neural networks with graph kernels. Advances in neural information processing systems
[b9] Chris Finlay; Jörn-Henrik Jacobsen; Levon Nurbekyan; Adam Oberman (2020). How to train your neural ode: the world of jacobian and kinetic regularization. PMLR
[b10] Tianxiang Gao (2024). Mastering infinite depths: Optimization and generalization in deeper neural networks. 
[b11] Tianxiang Gao; Hongyang Gao (2022). Gradient descent optimizes infinite-depth relu implicit networks with linear widths. 
[b12] Tianxiang Gao; Hongyang Gao (2022). On the optimization and generalization of overparameterized implicit neural networks. 
[b13] Tianxiang Gao; Hailiang Liu; Jia Liu; Hridesh Rajan; Hongyang Gao (2021). A global convergence theory for deep relu implicit networks via over-parameterization. 
[b14] Tianxiang Gao; Xiaokai Huo; Hailiang Liu; Hongyang Gao (2023). Wide neural networks as gaussian processes: Lessons from deep equilibrium models. Advances in Neural Information Processing Systems
[b15] Amir Gholami; Kurt Keutzer; George Biros (2019). Anode: unconditionally accurate memory-efficient gradients for neural odes. 
[b16] Amir Gholaminejad; Kurt Keutzer; George Biros (2019). Anode: Unconditionally accurate memoryefficient gradients for neural odes. 
[b17] Xavier Glorot; Yoshua Bengio (2010). Understanding the difficulty of training deep feedforward neural networks. 
[b18] Tilmann Gneiting (2013). Strictly and non-strictly positive definite functions on spheres. Bernoulli
[b19] Will Grathwohl; Ricky Tq Chen; Jesse Bettencourt; Ilya Sutskever; David Duvenaud (2018). Ffjord: Free-form continuous dynamics for scalable reversible generative models. 
[b20] Samuel Greydanus; Misko Dzamba; Jason Yosinski (2019). Hamiltonian neural networks. Advances in neural information processing systems
[b21] Soufiane Hayou; Greg Yang (2023). Width and depth limits commute in residual networks. PMLR
[b22] Kaiming He; Xiangyu Zhang; Shaoqing Ren; Jian Sun (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. 
[b23] Kaiming He; Xiangyu Zhang; Shaoqing Ren; Jian Sun (2016). Deep residual learning for image recognition. 
[b24] Kaiming He; Xiangyu Zhang; Shaoqing Ren; Jian Sun (2016). Identity mappings in deep residual networks. Springer
[b25] Jonathan Ho; Ajay Jain; Pieter Abbeel (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems
[b26]  Hochreiter (1997). Long short-term memory. Neural Computation
[b27] Jiri Hron; Yasaman Bahri; Jascha Sohl-Dickstein; Roman Novak (2020). Infinite attention: Nngp and ntk for deep attention networks. PMLR
[b28] Sergey Ioffe (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. 
[b29] Arthur Jacot; Franck Gabriel; Clément Hongler (2018). Neural tangent kernel: Convergence and generalization in neural networks. 
[b30] George Em Karniadakis; G Ioannis; Lu Kevrekidis; Paris Lu; Sifan Perdikaris; Liu Wang;  Yang (2021). Physics-informed machine learning. Nature Reviews Physics
[b31] Patrick Kidger; James Morrill; James Foster; Terry Lyons (2020). Neural controlled differential equations for irregular time series. Advances in Neural Information Processing Systems
[b32] Joon-Hyuk Ko; Hankyul Koh; Nojun Park; Wonho Jhe (2023). Homotopy-based training of neuralodes for accurate dynamics discovery. Advances in Neural Information Processing Systems
[b33] Yann Lecun; Léon Bottou; Yoshua Bengio; Patrick Haffner (1998). Gradient-based learning applied to document recognition. 
[b34] Jaehoon Lee; Jascha Sohl-Dickstein; Jeffrey Pennington; Roman Novak; Sam Schoenholz; Yasaman Bahri (2018). Deep neural networks as gaussian processes. 
[b35] Xuechen Li; Ting-Kam Leonard Wong; Ricky Tq Chen; David Duvenaud (2020). Scalable gradients for stochastic differential equations. PMLR
[b36] Ge Yang; Samuel Schoenholz (2017). Mean field residual networks: On the edge of chaos. Advances in neural information processing systems
[b37] Greg Yang (2019). Wide feedforward or recurrent neural networks of any architecture are gaussian processes. 
[b38] Greg Yang (2020). Tensor programs ii: Neural tangent kernel for any architecture. 
[b39] Greg Yang; Dingli Yu; Chen Zhu; Soufiane Hayou (2024). Tensor programs VI: Feature learning in infinite depth neural networks. 
[b40] Juntang Zhuang; Nicha Dvornek; Xiaoxiao Li; Sekhar Tatikonda; Xenophon Papademetris; James Duncan (). Adaptive checkpoint adjoint method for gradient estimation in neural ode. 
[b41]  Pmlr (2020). . 
[b42] Juntang Zhuang; C Nicha;  Dvornek;  James S Duncan (2020). Mali: A memory efficient and reverse accurate integrator for neural odes. 
[b43] Difan Zou; Yuan Cao; Dongruo Zhou; Quanquan Gu (2020). Gradient descent optimizes overparameterized deep relu networks. Machine learning
[b44]  (). Providing rigorous conditions for the well-definedness of the Neural ODE NTK, demonstrating its strict positive definiteness (SPD) under suitable activation function properties. 
[b45]  (). Global Convergence: Extending global convergence guarantees for gradient descent in overparameterized Neural ODEs, bridging the gap between discrete and continuous-depth models. 

Figures:
Figure fig_0: 1
Type: figure
Caption: Figure 1 :1Figure 1: Analysis of Neural ODE output, gradient differences, and NTK convergence. (a) Output differences between Neural ODE and finite-depth ResNet across different widths using Softplus activation. (b) Gradient differences for Neural ODE and ResNet models under Softplus activation. (c) NTK convergence behavior across different widths, showing the NTK approximation converging to the limiting NTK as width increases. (d) NTK convergence behavior on a log-log scale, further emphasizing the rapid convergence at larger widths.
Data: 

Figure fig_1: 2
Type: figure
Caption: Figure 2 :2Figure 2: Empirical results of Neural ODEs with varying widths: (a) NTK smallest eigenvalue grows and stabilizes as the width increases, with negative values for widths below the training size. (b) Parameter distances stay stable and bounded within O (1). (c) Linear-scale train and test losses show faster convergence for larger widths. (d) Log-scale losses further confirm improved generalization for wider models.
Data: 

Figure fig_2: 8
Type: figure
Caption: Theorem 8 (8Moore-Osgood Theorem). If lim n→∞ a n,m = b m uniformly in m, and lim m→∞ a n,m = c n for each n, then both lim m→∞ b m and lim n→∞ c n exists and are equal to the double limit, i.e.,
Data: 

Figure fig_3: 3
Type: figure
Caption: Figure 3 :3Figure 3: Effects of increasing time horizons on Neural ODE outputs and training. (a) Output variance increases as the time horizon T becomes large at initialization. (b) This leads to damping during the early stages of training with gradient descent. (c) Scaling the dynamics by setting the weight variance σ w ∼ 1/T reduces the output variance. (d) This scaling also mitigates early-stage damping, improving training stability.
Data: 

Figure fig_5: 4
Type: figure
Caption: Figure 4 :4Figure 4: Gaussian fit of the sample distribution from 10,000 randomly initialized Neural ODEs across widths 10, 50, 100, 200, 500, and 1000. The corresponding KS statistics and p-values are displayed, showing improved Gaussian fit as width increases.
Data: 

Figure fig_6: 5
Type: figure
Caption: Figure 5 :5Figure 5: Pairplots of output neurons given the same input data, showing that output neurons become independent as network width increases.
Data: 

Figure fig_7: 678
Type: figure
Caption: Figure 6 :Figure 7 :Figure 8 :678Figure 6: Comparison of input and output covariance matrices. (a) Input covariance matrix. (b) Output covariance matrix from Neural ODEs, showing similar structure but reduced variance. (c) Least eigenvalues of the covariance matrices, confirming positive definiteness as width increases.
Data: 

Figure fig_8: 
Type: figure
Caption: (d). Despite this, the quadratic Neural ODE converged much more slowly than Softplus, as illustrated in the training and test losses (Figure8(a)-(b)).
Data: 

Figure fig_9: 9
Type: figure
Caption: Figure 9 :9Figure 9: Training and test loss behavior for CIFAR-10 across different activations: (a) Softplus, (b) ReLU, and (c) GELU. All activations show similar convergence patterns, with larger widths leading to faster convergence.
Data: 

Figure fig_10: 101213
Type: figure
Caption: Figure 10 :Figure 12 :Figure 13 :101213Figure 10: Training and test loss behavior for AG News across different activations: (a) Softplus, (b) ReLU, and (c) GELU. Softplus converges fastest, while GELU lags due to its derivative behavior.
Data: 

Figure fig_11: 14
Type: figure
Caption: HFigure 14 :14Figure 14: Sensitivity of "Optimize-then-discretize" to ODE solvers. (a) Training and test losses decrease consistently for all three solvers at width 500. (b) Training and test losses decrease consistently for all three solvers at width 2000. (c) Time taken by each ODE solver across different widths, highlighting the scalability advantage of fixed-step solvers.
Data: 

Figure tab_0: 
Type: table
Caption: , does not occur here. This stability results from a combination of several factors, including skip connections, scaling κ, and smoothness and nonlinearity of ϕ. With stable information propagation in Neural ODEs, we can use the nonlinearity of ϕ to show that the NNGP kernel Σ * and the limiting NTK K ∞ are SPD. Proposition 5. If ϕ is Lipschitz, nonlinear but non-polynomial, then the NNGP kernel Σ * is SPD. Corollary 1. Suppose ϕ and ϕ ′ are Lipschitz continuous. If ϕ is nonlinear but non-polynomial, then the limiting NTK K ∞ is SPD.
Data: With these results, we can establish the global convergence of Neural ODEs under gradient descentwith appropriate assumptions about the activation function ϕ and the training data.Assumption 1. Let {x i , y i } N i=1 be a training set. Assume1. Training set:

Figure tab_1: 
Type: table
Caption: Stefano Massaroli, Michael Poli, Jinkyoo Park, Atsushi Yamashita, and Hajime Asama. Dissecting neural odes. Advances in Neural Information Processing Systems, 33:3952-3963, 2020. Takashi Matsubara, Yuto Miyatake, and Takaharu Yaguchi. Symplectic adjoint method for exact gradient of neural ode with minimal memory. Advances in Neural Information Processing Systems, 34:20772-20784, 2021. Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012. Quynh Nguyen. On the proof of global convergence of gradient descent for deep relu networks with linear widths. In International Conference on Machine Learning, pp. 8056-8062. PMLR, 2021. Derek Onken, Samy Wu Fung, Xingjian Li, and Lars Ruthotto. Ot-flow: Fast and accurate continuous normalizing flows via optimal transport. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 9223-9232, 2021. Katharina Ott, Prateek Katiyar, Philipp Hennig, and Michael Tiemann. Resnet after all: Neural odes and their numerical solution. In International Conference on Learning Representations, 2020.
Data: 

Figure tab_3: 
Type: table
Caption: we have h ℓ → h t ℓ and λ ℓ → λ t ℓ . By utilizing the result from convergence analysis of Euler's method equation 7, we obtain the convergence rate, that indicates this convergence is uniform in width n, if the activation function is smooth. This result serves as a fundamental result to ensure the NTK for Neural ODE is well defined and allow us to study the training dynamics of Neural ODEs under gradient-based methods. Lemma 10. If ϕ and ϕ ′ are L 1 -and L 2 -Lipschitz continuous, then the following inequalities hold for every x ∈ S d-1 a.s.:
Data: 


Formulas:
Formula formula_0: f (x; θ) = σ v √ n v T ϕ(h T ),(1)

Formula formula_1: h 0 = σ u U x/ √ d, and ḣt = σ w W ϕ(h t )/ √ n, ∀t ∈ [0, T ],(2)

Formula formula_2: U ij , W ij , v i i.i.d.

Formula formula_3: λ T = σ v diag(ϕ ′ (h t ))v/

Formula formula_4: ∇ v f (x; θ) = σ v √ n ϕ(h t ), ∇ W f (x; θ) = T 0 σ w √ n λ t ϕ(h t ) ⊤ dt, ∇ U f (x; θ) = σ u √ d λ 0 x ⊤ . (5

Formula formula_5: )

Formula formula_6: L(θ) = N i=1 1 2 (f (x i ; θ) -y i ) 2 = 1 2 ∥u -y∥ 2 ,(6)

Formula formula_7: u k+1 -y ≈ I -ηH k (u k -y),(8)

Formula formula_8: K(x, x; θ) := ⟨∇ θ f (x; θ), ∇ θ f ( x; θ)⟩ .(9)

Formula formula_9: f L (x; θ) = σ v √ n v ⊤ ϕ(h L (x)),(10a)

Formula formula_10: h ℓ = h ℓ-1 + κ • σ w √ n W ϕ(h ℓ-1 ), ∀ℓ ∈ {1, 2, • • • , L}(10b)

Formula formula_11: h 0 = σ u √ d U x,(10c)

Formula formula_12: ∥∇ θ f L (x) -∇ θ f (x)∥ ≤ CL -1 , ∀ℓ ∈ {0, 1, • • • , L},(11)

Formula formula_13: C 0,k (x, x) = δ 0,k σ 2 u d x T x, ∀k ∈ {0, 1, • • • , L + 1} (12) C ℓ,k (x, x) = σ 2 w Eϕ(u ℓ-1 )ϕ(ū k-1 ), ∀ℓ, k ∈ {1, 2, • • • , L + 1} (13

Formula formula_14: )

Formula formula_15: E(u ℓ ūk ) = C 0,0 (x, x) + κ 2 ℓ,k i,j=1 C i,j (x, x), ∀ℓ, k ∈ {0, 1, • • • , L}.(14)

Formula formula_16: lim ℓ→∞ ( lim n→∞ a n,ℓ ) = 1, while lim n→∞ ( lim ℓ→∞ a n,ℓ ) = 0.

Formula formula_17: ⟨ϕ(h L ), ϕ( hL )⟩/n satisfies ⟨ϕ(h L ), ϕ( hL )⟩ -⟨ϕ(h T ), ϕ( hT )⟩ /n ≤ CL -1 , (15

Formula formula_18: )

Formula formula_19: Σ L converges to Σ * with a rate of |Σ L (x, x) -Σ * (x, x)| ∼ CL -1

Formula formula_20: K L (x, x; θ) := ∇ θ f L (x; θ), ∇ θ f L ( x; θ) .(16)

Formula formula_21: K L ∞ (x, x) = C L+1,L+1 (x, x) + κ 2 L ℓ,k=1 C ℓ,k (x, x)D ℓ,k (x, x) + C 0,0 (x, x)D 0,0 (x, x), (17

Formula formula_22: )

Formula formula_23: D L,k (x, x) = σ 2 w Eϕ ′ (u L )ϕ ′ (ū L ), ∀k ∈ {0, 1, • • • , L},(18)

Formula formula_24: D ℓ,k (x, x) = κ 2 ℓ+1,k+1 i,j=L D i,j (x, x)E[ϕ ′ (u i )ϕ ′ (ū j )] ∀ℓ, k ∈ {1, 2, • • • , L -1}. (19

Formula formula_25: )

Formula formula_26: K L θ (x, x) -K θ (x, x) ≤ CL -1 , ∀x, x ∈ S d-1 ,(20

Formula formula_27: K θ → K ∞ , as n → ∞,(21)

Formula formula_28: Σ * (x, x) = E[ϕ(u)ϕ(ū)],

Formula formula_29: S * (x, x) = lim L→∞ C 0,0 (x, x) + κ 2 L ℓ,k=1 C ℓ,k (x, x)

Formula formula_30: x i ∈ S d-1 and x i ̸ = x j for all i ̸ = j; |y i | = O (1),

Formula formula_31: ∥θ k -θ 0 ∥ ≤ C∥X∥ L(θ 0 )/λ 0 . (22

Formula formula_32: )

Formula formula_33: L(θ k ) ≤ 1 - ηλ 0 16 k L(θ 0 ),(23)

Formula formula_34: s min (A) = √ N - √ n + o √ n , s max (A) = √ N + √ n + o √ n , almost surely. (24) Theorem 5 (Picard-Lindelöf theorem). Let f : [a, b] × R n → R n be a function. If f is continuous

Formula formula_35: x(t) = f (t, x(t)),(25)

Formula formula_36: ẋ = f (x, t), t ∈ [t 0 , t 1 ], and x(0) = x 0 . (26

Formula formula_37: )

Formula formula_38: ∥x(t n ) -x n ∥ ≤ hM 2L (e L(tn-t0) -1), (27

Formula formula_39: )

Formula formula_40: u(t) ≤ α(t) + t 0 β(s)u(s)ds, ∀t ∈ I. (28

Formula formula_41: )

Formula formula_42: u(t) ≤ α(t) + t 0 α(s)β(s) exp t s β(r)dr , ∀t ∈ I.(29)

Formula formula_43: u(t) ≤ α(t) exp t 0 β(s)ds , ∀t ∈ I.

Formula formula_45: ḣt = σ w √ n W ϕ(h t ), h 0 = σ u √ d U x.

Formula formula_46: L( h, θ, λ, µ) = f θ (x) + T 0 λ ⊤ t σ w √ n W ϕ( h) -ḣ dt + µ ⊤ σ u √ d U x -h(0)]

Formula formula_47: L(h, θ, λ, µ) = f θ (x), ∀(λ, µ).

Formula formula_48: δL(h, θ, λ, µ) = σ v √ n (δv) ⊤ ϕ(h(T )) + σ v √ n v ⊤ diag(ϕ ′ (h(T )))δh(T ) + µ ⊤ σ u √ d (δU )x -δh(0) + T 0 λ ⊤ σ w √ n (δW )ϕ(h) + σ w √ n W diag(ϕ ′ (h))δh -δ ḣ dt = σ v √ n (δv) ⊤ ϕ(h(T )) + σ v √ n v ⊤ diag(ϕ ′ (h(T )))δh(T ) + µ ⊤ σ u √ d (δU )x -δh(0) -λ ⊤ δh| T 0 + T 0 λ⊤ δhdt + T 0 λ ⊤ σ w √ n (δW )ϕ(h) + σ w √ n W diag(ϕ ′ (h))δh dt = σ v √ n (δv) ⊤ ϕ(h(T )) + σ v √ n v ⊤ diag(ϕ ′ (h(T ))) -λ(T ) T δh(T ) + µ ⊤ σ u √ d (δU )x + (λ(0) -µ) ⊤ δh(0) + T 0 λ⊤ + σ w √ n λ ⊤ W diag(ϕ ′ (h)) δhdt + T 0 σ w √ n λ ⊤ (δW )ϕ(h)dt,

Formula formula_49: λ(T ) = σ v √ n diag(ϕ ′ (h(T )))v, λ(t) = - σ w √ n diag(ϕ ′ (h(t)))W ⊤ λ(t).

Formula formula_50: δL(h, θ, λ, µ) = σ v √ n ϕ(h(T )) ⊤ δv + σ u √ d µ ⊤ (δU )x + T 0 σ w √ n λ ⊤ (δW )ϕ(h)dt.

Formula formula_51: ∇ v f θ (x) = σ v √ n ϕ(h(T )) ∇ W f θ (x) = T 0 σ w √ n λ t ϕ(h t ) ⊤ dt ∇ U f θ (x) = σ u √ d λ(0)x ⊤ .

Formula formula_52: f : x → σw √ n W ϕ(x) is Lipschitz continuous: ∥f (x) -f (z)∥ =∥ σ w √ n W ϕ(x) - σ w √ n W ϕ(z)∥ ≤σ w ∥ϕ(x) -ϕ(z)∥ ≤σ w L 1 ∥x -z∥.

Formula formula_53: h N = ϕ ε (h N -1

Formula formula_54: λ T = σ v √ n diag(ϕ ′ (h T )v, (31

Formula formula_55: ) λt = - σ w √ n diag(ϕ ′ (h t ))W ⊤ λ t .(32)

Formula formula_56: x → -σ √ n diag[ϕ ′ (h t )]W ⊤ x is Lipschitz continuous: ∥g(x) -g(z)∥ = ∥ σ w √ n diag[ϕ ′ (h t )]W ⊤ (x -z)∥ ≤ σ w L 1 ∥x -z∥,

Formula formula_57: df θ = dv ⊤ ϕ(h(T ))/ √ n = 1 √ n v ⊤ diag(ϕ ′ (h(T )))dh(T ).

Formula formula_58: ∂f θ (x) ∂h(T ) = 1 √ n diag(ϕ ′ (h(T )))v.(33)

Formula formula_59: ∂h(t) = ∂h(t + ε) ∂h(t) ∂f θ (x) ∂h(t + ε) .

Formula formula_60: h(t + ε) = h(t) + t+ε t 1 √ n W ϕ(h(s))ds. (34

Formula formula_61: )

Formula formula_62: d dt ∂f θ (x) ∂h(t) = lim ε→0 + ∂f θ ∂h(t+ε) -∂f θ ∂h(t) ε = lim ε→0 + ∂f θ ∂h(t+ε) -∂h(t+ε) ∂h(t) ∂f θ ∂h(t+ε) ε = lim ε→0 + ∂f θ ∂h(t+ε) -∂ ∂h(t) h(t) + 1 √ n W ϕ(h(t))ε + O ε 2 ∂f θ ∂h(t+ε) ε = lim ε→0 + ∂f θ ∂h(t+ε) -I + ε √ n diag(ϕ ′ (h(t)))W ⊤ + O ε 2 ∂f θ ∂h(t+ε) ε = - 1 √ n diag(ϕ ′ (h(t)))W ⊤ ∂f θ ∂h(t) .

Formula formula_63: ∂ v f θ (x) = σ v √ n ϕ(h(T )) (35a) ∂ W f θ (x) = T 0 σ w √ n (ϕ(h t ) ⊗ λ t )dt (35b) ∂ U f θ (x) = σ u √ d [x ⊗ λ(0)] .(35c)

Formula formula_64:  ḣt λt ġt   = σ w √ n   W ϕ(h t ) -diag[ϕ ′ (h t )W ⊤ ]λ t -ϕ(h t ) ⊗ λ t   , ∀t ∈ [0, T ](36)

Formula formula_65: ∇ W f θ (x) =g(0) =g(T ) + 0 T ġt dt =g(T ) + 0 T - σ w √ n ϕ(h t ) ⊗ λ t dt = T 0 σ w √ n [ϕ(h t ) ⊗ λ t ] dt,

Formula formula_66: g ℓ ∈ R n g 0 (x) := σ v √ d U x,(37)

Formula formula_67: g ℓ (x) := σ w √ n W ϕ(h ℓ-1 ), ∀ℓ ∈ [1, 2, • • • , L].(38)

Formula formula_68: 1 n n α=1 ψ(g 0 α , . . . , g M α ) a.s. → Eψ(z 0 , • • • , z M ),(39)

Formula formula_69: Algorithm 1 ResNet f L θ Forward Computation on Input x Input: U x/ √ d : G(n) Input: W : A(n, n) Input: v : G(n) 1: h 0 := U x/ √ d : G(n) 2: for ℓ ∈ [L] do 3:

Formula formula_70: g ℓ := W x ℓ-1 / √ n : G(n) 5: h ℓ := h ℓ-1 + κ • g ℓ : G(n) 6: end for 7: x L = ϕ(h L ) : H(n) Output: v T x L / √ n

Formula formula_71: BASIC CASE L = 0 As L = 0, we have f 0 θ (x) = v T ϕ(h 0 )/ √ n.

Formula formula_72: g 0 k i.i.d. ∼ :=Σ 0 (x,x)

Formula formula_73: f 0 θ |B 0 ∼ N (0, ∥ϕ 0 ∥ 2 /n),

Formula formula_74: σ 2 v ∥ϕ 0 ∥ 2 /n = σ 2 v n n k=1 ϕ(h 0 k ) 2 = σ 2 v n n k=1 ϕ(g 0 k ) 2 a.s.

Formula formula_75: f 0 θ → GP(0, Σ 1 ), where Σ 1 (x, x) = E z 0 ∼Σ 0 ϕ(z 0 (x))ϕ(z 0 (x)).

Formula formula_76: GENERAL CASE L Now consider f L θ (x) = v T ϕ(h L )/ √ n.

Formula formula_77: g ℓ = W ϕ(h ℓ-1 ), ∀ℓ ∈ {1, 2, • • • , L -1} or equivalently g 1 • • • g L-1 :=G = W ϕ 0 • • • ϕ L-2 :=Φ

Formula formula_78: min W 1 2 ∥W ∥ 2 F , s.t. G = W Φ.

Formula formula_79: L(W, V ) = 1 2 ∥W ∥ 2 F + ⟨V, G -W Φ⟩ Then ∇ W L(W, V ) = W -V Φ T = 0 =⇒ W * = V Φ T .

Formula formula_80: G = W Φ = V Φ T Φ =⇒ V = G(Φ T Φ) † =⇒ W * = G(Φ T Φ) † Φ T .

Formula formula_81: W |B = W * + W Π T = G(Φ T Φ) † Φ T + W I n -ΦΦ † ,

Formula formula_82: Π = I n -ΦΦ † , W is i.i.d.copy of W , and Φ † = (Φ T Φ) † Φ T .

Formula formula_83: g L k |B independent ∼ N (G k * (Φ T Φ) † Φ T ϕ, ∥Π T ϕ∥ 2 /n).

Formula formula_84: ϕ i , ϕ j /n = 1 n n k=1 ϕ(h i k )ϕ(h j k ) = 1 n n k=1 ϕ(g 0 k + βg 1 k + • • • + βg i k )ϕ(g 0 k + βg 1 k + • • • + βg j k ) a.s. → Eϕ(z 0 + βz 1 + • • • + βz i )ϕ(z 0 + βz 1 + • • • + βz j ) =:Eϕ(u i )ϕ(u j ),

Formula formula_85: u i = z 0 + βz 1 + • • • + βz i .

Formula formula_86: (Φ T Φ) ij /n = ϕ i , ϕ j /n a.s. → Eϕ(u i )ϕ(u j ), (Φ T ϕ) i /n = ϕ i , ϕ /n a.s. → Eϕ(u i )ϕ(u L-1 ). For ℓ ∈ {0, 1, • • • , L -1}, let U ℓ = {u 0 , • • • , u ℓ } be a collection of u i . We define Σ(U ℓ , U k ) ∈ R (ℓ+1)×(k+1) as Σ(U ℓ , U k ) ij = Σ(u i , u j ) = Eϕ(u i )ϕ(u j ), ∀i ∈ {0, 1, • • • , ℓ}, j ∈ {0, 1, • • • , k}.

Formula formula_87: (Φ T Φ) † Φ T ϕ = (Φ T Φ/n) † Φ T ϕ/n → Σ(U L-2 , U L-2 ) † Σ(U L-2 , u L-1 ).

Formula formula_88: ∥Π T ϕ∥ 2 /n = 1 n ϕ T (I n -ΦΦ † )ϕ = 1 n ϕ T ϕ - 1 n ϕ T Φ(Φ T Φ) † Φ T ϕ =ϕ T ϕ/n -(ϕ T Φ/n)(Φ T Φ/n) † (Φ T ϕ/n) →Σ(u L-1 , u L-1 ) -Σ(u L-1 , U L-2 )Σ(U L-2 , U L-2 ) † Σ(U L-2 , u L-1 )

Formula formula_89: 1 n n k=1 ψ(g 0 k , g 1 k , • • • , g L k ) → E ψ(z 0 , z 1 , • • • , z L ) ,

Formula formula_90: Cov(z 0 (x), z ℓ (x)) = 0, ∀ℓ ≥ 1 Cov(z ℓ (x), z k (x)) = E ϕ u ℓ-1 (x) ϕ u k-1 (x) , ∀ℓ, k ≥ 1 Let B L be the smallest σ-algebra generated by {g 0 , • • • , g L }.

Formula formula_91: f L θ (x)|B L ∼ N (0, ∥ϕ L ∥ 2 /n) (40) where ∥ϕ L ∥ 2 /n = 1 n n k=1 ϕ(h L k ) 2 = 1 n n k=1 ϕ g 0 k + β L i=1 g i k 2 a.s. → E ϕ z 0 + β L i=1 z i 2 = E[ϕ(u L )] 2 := Σ L+1 (x, x)

Formula formula_92: f L θ → GP(0, Σ L+1 ) where Σ L+1 (x, x) = E ϕ u L (x) ϕ u L (x) .

Formula formula_93: f θ (x)|B ∼ N

Formula formula_94: ϕ L (x) → ϕ T (x), as L → ∞,

Formula formula_95: a n,L := ϕ L (x), ϕ L (x) /n.

Formula formula_96: ∥h t ∥ ≤ C √ ne CσL1t , ∀t ∈ [0, T ] (41) and ∥h ℓ -h(t ℓ )∥ ≤ A 2B e Bt ℓ -1 T L √ n,(42)

Formula formula_97: x → σ √ n W ϕ(x) is σL 1 -Lipschitz continuous. Observe that d( ḣ) =dσW ϕ(h(t))/ √ n = σ √ n W diag [ϕ ′ (h(t))] dh(t) = σ √ n W diag [ϕ ′ (h(t))] ḣ(t)dt = σ √ n W diag [ϕ ′ (h(t))] σ √ n W ϕ(h(t))dt.

Formula formula_98: ḧ = d dt ḣ = σ √ n W diag [ϕ ′ (h(t))] σ √ n W ϕ(h(t))

Formula formula_99: ∥ ḧ∥ ≤ C 2 σ 2 L 2 1 ∥h(t)∥

Formula formula_100: h(t) = h(0) + t 0 ḣds implies ∥h(t)∥ ≤∥h(0)∥ + t 0 ∥ σ √ n W ϕ(h(s))∥ds ≤∥h(0)∥ + t 0

Formula formula_101: ∥h(t)∥ ≤ ∥h(0)∥ exp t 0 CσL 1 ds = ∥h(0)∥e CσL1t

Formula formula_102: ∥h(t)∥ ≤ C √ ne CσL1t , ∀t ∈ [0, T ].

Formula formula_103: ∥ ḧ(t)∥ ≤ Cσ 2 L 2 1 √ ne CσL1t , ∀t ∈ [0, T ].

Formula formula_104: ∥h ℓ -h(t ℓ )∥ ≤ A 2B e Bt ℓ -1 T L √ n,

Formula formula_105: 1 n ϕ(h k (x)), ϕ(h ℓ (x)) - 1 n ⟨ϕ(h t k (x)), ϕ(h t ℓ (x))⟩ ≤ C 1 L -1 , ∀k, ℓ ∈ [L] (43

Formula formula_106: )

Formula formula_107: ϕ k , φℓ /n -ϕ(kβ)), φ(ℓβ) /n = 1 n ϕ k , φℓ -φ(ℓβ) + 1 n ϕ k -ϕ(kβ)), φ(ℓβ) ,

Formula formula_108: ∥h ℓ+1 ∥ = ∥h ℓ + T L σ √ n W ϕ(h ℓ )∥ ≤ ∥h ℓ ∥ + Cσ T L ∥h ℓ ∥ = (1 + CσT /L)∥h ℓ ∥.

Formula formula_109: ∥h ℓ+1 ∥ ≤ (1 + CσT /L) ℓ+1 ∥h 0 ∥ Therefore, we obtain ∥ϕ ℓ ∥ ≤ ∥h ℓ (x)∥ ≤ (1 + CσT /L) ℓ ∥h 0 ∥ ≤ e CσT ℓ/L ∥h 0 ∥ ≤ C √ ne CσT ℓ/L ,

Formula formula_110: ∥ϕ ℓ -ϕ(ℓβ)∥ ≤ ∥h ℓ -h(ℓβ)∥ ≤ C 1 √ nL -1 ,

Formula formula_111: ϕ k , φℓ /n -ϕ(kβ)), φ(ℓβ) /n ≤ 1 n • C 1 √ n • C 1 √ nL -1 = C 1 L -1 .

Formula formula_112: lim n→∞ ⟨ϕ(h T (x)), ϕ(h T (x))⟩ /n = lim n→∞ lim L→∞ ϕ(h L (x)), ϕ(h L (x)) /n = lim L→∞ lim n→∞ ϕ(h L (x)), ϕ(h L (x)) /n = lim L→∞ Σ L+1 (x, x) =Σ * (x, x).

Formula formula_113: λℓ+1 = λℓ -β • σ w √ n diag[ϕ ′ (h t ℓ )]W T λℓ , ∀ℓ ∈ [1, 2, • • • , L](44)

Formula formula_114: λ ℓ+1 = λ ℓ -β • σ w √ n diag[ϕ ′ (h ℓ )]W T λ ℓ , ∀ℓ ∈ [1, 2, • • • , L]. (45) As L → ∞ or β → ∞,

Formula formula_115: ∥λ t ∥ ≤ CσL 1 e CσL1(T -t) , ∀t ∈ [0, T ](46)

Formula formula_116: ∥λ ℓ -λ t ∥ ≤ T L C 1 C 2 e C2(T -t ℓ ) -1 ,(47)

Formula formula_117: C 1 = CL 2 1 L 2 σ 3 e CσL1T , C 2 = CσL 1 +Cσ 2 L 1 L 2 e

Formula formula_118: (λ, t) → -1 √ n diag[ϕ ′ (h t )]W T λ, we consider d λ =d - σ √ n diag [ϕ ′ (h(t))] W T λ =d -ϕ ′ (h(t)) ⊙ W T λ = -[dϕ ′ (h t )] ⊙ W T λ = -ϕ ′′ (h t ) ⊙ dh t ⊙ W T λ = -ϕ ′′ (h t ) ⊙ W T λ ⊙ ḣdt = -ϕ ′′ (h t ) ⊙ W T λ ⊙ W ϕ(h t )dt = -diag (ϕ ′′ (h t )) diag W T λ W ϕ(h t )dt,

Formula formula_119: ∂ t f (λ, t) = -diag (ϕ ′′ (h t )) diag W T λ W ϕ(h t ).

Formula formula_120: ∥∂ t f (λ, t)∥ ≤ |ϕ ′′ | • σ √ n ∥λ∥ • ∥ W ∥ • ∥ϕ(h t )∥ ≤ CL 1 L 2 σ 2 ∥λ∥ • ∥h t ∥/ √ n,

Formula formula_121: ∥λ t ∥ ≤ ∥λ T ∥ + T t ∥ λ∥ts ≤ CσL 1 + T t CσL 1 ∥λ s ∥ds.

Formula formula_122: ∥λ t ∥ ≤ CσL 1 exp T t CσL 1 ds ≤ CσL 1 e CσL1(T -t) .

Formula formula_123: ∥∂ t f (λ, t)∥ ≤ CL 2 1 L 2 σ 3 e CσL1T := C 1 .

Formula formula_124: ∥λ ℓ+1 -λ(t ℓ+1 )∥ = λ ℓ -βdiag[ϕ ′ (h ℓ )] W T λ ℓ -λ(t ℓ ) + β λ(t ℓ ) + β 2 2 λ(t ℓ ) ≤∥λ ℓ -λ(t ℓ )∥ + β∥diag[ϕ ′ (h ℓ )] W T λ ℓ -diag[ϕ ′ (h(t ℓ ))] W T λ(t ℓ )∥ + β 2 2 C 1 ,

Formula formula_125: ∥diag[ϕ ′ (h ℓ )] W T λ ℓ -diag[ϕ ′ (h(t ℓ ))] W T λ(t ℓ )∥ ≤∥diag[ϕ ′ (h ℓ )] W T (λ ℓ -λ t ℓ )∥ + ∥(diag[ϕ ′ (h ℓ )] -diag[ϕ ′ (h(t ℓ ))]) W T λ(t ℓ )∥ ≤L 1 ∥ W ∥∥λ ℓ -λ t ℓ ∥ + L 2 ∥h ℓ -h t ℓ ∥∥ W ∥∥λ t ℓ ∥ ≤C 2 ∥λ ℓ -λ t ℓ ∥ + ∥h ℓ -h t ℓ ∥ ,

Formula formula_126: C 2 = CσL 1 + Cσ 2 L 1 L 2 e CσL1T

Formula formula_127: ∥λ ℓ+1 -λ t ℓ+1 ∥ ≤ ∥λ ℓ -λ(t ℓ )∥ + βC 2 ∥λ ℓ -λ t ℓ ∥ + ∥h ℓ -h t ℓ ∥ + β 2 C 1 . Denote E ℓ = ∥λ ℓ -λ t ℓ ∥ + ∥h ℓ -h t ℓ ∥, then we have ∥λ ℓ -λ t ℓ ∥ ≤ E ℓ ≤ (1 + βC 2 )E ℓ-1 + β 2 C 1 .

Formula formula_128: E ℓ ≤ (1 + βC 2 ) ℓ E 0 + β 2 C 1 • (1 + βC 2 ) ℓ -1 (1 + βC 2 ) -1 .

Formula formula_129: E ℓ ≤ T L C 1 C 2 e C2(T -t ℓ ) -1 .

Formula formula_130: ∂f θ ∂h t ℓ - ∂f L θ ∂h ℓ ≤ C 0 L -1 , ∀ℓ ∈ [1, 2, • • • , L],(48)

Formula formula_131: ∥∇ v f L -∇ v f θ ∥ = σ √ n ∥ϕ(h L ) -ϕ(h(T )∥ ≤ σ √ n • CL -1 • √ n ≤ CL -1 , ∥∇ W f L -∇ W f θ ∥ = T 0 1 √ n ∂f ∂h t ϕ(h t )dt - L ℓ=1 T L 1 √ n ∂f L ∂h ℓ ϕ(h ℓ-1 ) ≤ 1 √ n L ℓ=1 t ℓ t ℓ-1 ∂f ∂h t ϕ(h t ) - ∂f L ∂h ℓ ϕ(h ℓ-1 ) dt ≤ 1 √ n L ℓ=1 t ℓ t ℓ-1 ∂f ∂h t - ∂f L ∂h ℓ ∥h t ∥ + ∥ ∂f L ∂h ℓ ∥∥h t -h ℓ-1 ∥dt ≤ C √ n L ℓ=1 t ℓ t ℓ-1 √ nL -1 dt ≤C L ℓ=1 L -2 = CL -1 , ∥∇ U f L -∇ U f θ ∥ ≤ σ √ d ∥x∥∥λ 0 -λ 0 ∥ ≤ CL -1 .

Formula formula_132: K θ (x, x) = ⟨∇ θ f θ (x), ∇ θ f θ (x)⟩ . (49

Formula formula_133: )

Formula formula_134: K L θ (x, x) := ∇ θ f L θ (x), ∇ θ f L θ (x) . (50

Formula formula_135: )

Formula formula_136: 1 n n α=1 ψ(g 0 α , . . . , g M α ) a.s. → Eψ(z 0 , • • • , z M ),(51)

Formula formula_137: Algorithm 2 ResNet f L θ Forward and Backward Computation on Input x Input: U x/ √ d : G(n) Input: W : A(n, n) Input: v : G(n) 1: h 0 := U x/ √ d : G(n) 2: for ℓ ∈ {1, 2, • • • , L} do 3:

Formula formula_138: g ℓ := W x ℓ / √ n : G(n) 5: h ℓ := h ℓ-1 + κ • g ℓ : G(n) 6: end for 7: x L = ϕ(h L ) : H(n) 8: dx L = v/ √ n : G(n) 9: dh L = dx L ⊙ ϕ ′ (h L ) : H(n) 10: for ℓ ∈ {L, L -1, • • • , 1} do 11: dg ℓ = κ • dh ℓ : H(n) 12: dx ℓ = W ⊤ dg ℓ / √ n : G(n) 13: dh ℓ-1 = dh ℓ + ϕ ′ (h ℓ -1) ⊙ dx ℓ : H(n) 14: end for Output: ∥x L ∥ 2 /n + L ℓ=1 dg ℓ x ℓ⊤ , dg ℓ x ℓ⊤ /n + dh 0 x ⊤ , dh 0 x ⊤ /d

Formula formula_139: dg L+1 := σ v √ n diag[ϕ ′ (h L )]v, dg ℓ := σ w √ n diag[ϕ ′ (h ℓ-1 )]W T , ∀[1, 2, • • • , L].

Formula formula_140: K L θ (x, x) = ∇ v f L θ (x), ∇ v f L θ (x) + ∇ W f L θ (x), ∇ W f L θ (x) + ∇ U f L θ (x), ∇ U f L θ (x) .

Formula formula_141: CONVERGENCE OF ∇ v f, ∇ v f

Formula formula_142: ∇ v f = ϕ(h L )/ √ n (52) ∇ h L f = v ⊙ ϕ ′ (h L )/ √ n. (53

Formula formula_143: )

Formula formula_144: ∇ v f, ∇ v f = 1 n ϕ(h L ) ⊤ ϕ( hL ) a.s. → Eϕ(u L )ϕ(ū L ) = C L+1,L+1 (x, x),

Formula formula_145: CONVERGENCE OF ∇ W f, ∇ W f

Formula formula_146: g ℓ = 1 √ n W x ℓ-1 h ℓ = h ℓ-1 + κg ℓ , x ℓ = ϕ(h ℓ ).

Formula formula_147: ∇ W f = 1 √ n L ℓ=1 (∇ g ℓ f ) • (x ℓ-1 ) ⊤ (54)

Formula formula_148: ∇ W f, ∇ W f = L ℓ,k=1 dg ℓ , dḡ k • x ℓ-1 , xk-1 /n

Formula formula_149: 1 n x ℓ-1 , xk-1 = 1 n ϕ(h ℓ-1 ), ϕ( hk-1 ) a.s. → C ℓ,k (x, x).(55)

Formula formula_150: dx ℓ-1 = 1 √ n W ⊤ dg ℓ and dg ℓ = κdh ℓ = κ(dh ℓ+1 + dx ℓ ⊙ ϕ ′ (h ℓ )) = dg ℓ+1 + κ dx ℓ ⊙ ϕ ′ (h ℓ )

Formula formula_151: dg ℓ =κ L i=ℓ dx i ⊙ ϕ ′ (h i )(56)

Formula formula_152: E[Z dx ℓ-1 Z dx k-1 ] =κ 2 E   ℓ,k i,j=L Z dx i Zdx j ϕ ′ (u i )ϕ ′ (ū j )   =κ 2 ℓ,k i,j=L E Z dx i Zdx j E[ϕ ′ (u i )ϕ ′ (ū j )]

Formula formula_153: D ℓ,k (x, x) = κ 2 ℓ+1,k+1 i,j=L D i,j (x, x)E[ϕ ′ (u i )ϕ ′ (ū j )](57)

Formula formula_154: ∇ W f, ∇ W f a.s. -→ κ 2 L ℓ,k=1 C ℓ,k (x, x)D ℓ,k (x, x) (58) CONVERGENCE OF ∇ U f, ∇ U f As h 0 = U x, h 0 i = d j=1 U ij x j implies ∂h 0 k /∂U ij = δ k,i x j . Observe that ∇ U f, ∇ U f = i,j ∂f ∂U ij ∂ f U ij = ij α ∂h 0 α ∂U ij ∂f ∂h 0 α   β ∂ h0 β ∂U ij ∂ f ∂ h0 β   = α,β ∂f ∂h 0 α ∂ f ∂ h0 β i,j ∂h 0 α ∂U ij ∂ h0 β ∂U ij = α,β ∂f ∂h 0 α ∂ f ∂ h0 β i,j δ α,i x j δ β,i xj = α,β ∂f ∂h 0 α ∂ f ∂ h0 β • δ α,β x T x = α ∂f ∂h 0 α ∂ f ∂ h0 α • x T x a.s. → D 0,0 (x, x)C 0,0 (x, x),

Formula formula_155: ∇ θ f, ∇ θ f = ∇ v f, ∇ v f + ∇ W f, ∇ W f + ∇ U f, ∇ U f a.s. -→ C L+1,L+1 (x, x) + L ℓ,k=1

Formula formula_156: K L ∞ (x, x) = C L+1,L+1 (x, x) + L ℓ,k=1

Formula formula_157: K ∞ (x, x) = lim n→∞ ∇ θ f θ , ∇ θ fθ = lim n→∞ lim L→∞ ∇ θ f L θ , ∇ θ f L θ We have shown lim n→∞ ∇ θ f L θ , ∇ θ f L θ = K L ∞ (x, x

Formula formula_158: K θ (x, x) = ⟨∇ v f θ (x), ∇ v f θ (x)⟩ + ⟨∇ W f θ (x), ∇ W f θ (x)⟩ + ⟨∇ U f θ (x), ∇ U f θ (x)⟩ .

Formula formula_159: ∇ v f L (x), ∇ v f L (x) -⟨∇ v f θ (x), ∇ v f θ (x)⟩ = 1 n ϕ(h L (x)), ϕ(h L (x)) - 1 n ⟨ϕ(h(x, T )), ϕ(h(x, T ))⟩ = 1 n ϕ(h L (x)), ϕ(h L (x)) -ϕ(h(x, T )) + 1 n ϕ(h L (x)) -ϕ(h(x, T )), ϕ(h(x, T )) ≤ L 2 1 n ∥h L (x)∥∥h L (x) -h(x, T )∥ + L 2 1 n ∥h L (x) -h(x, T )∥∥h(x, T )∥ ≤ 1 n C √ n • √ nL -1 =CL -1 ,

Formula formula_160: ∥∇ W f (x)∥ =∥ T 0 1 √ n λ t ϕ(h t )dt∥ ≤∥ T 0 1 √ n • e Cσ(T -t) • √ ne Cσt dt∥ ≤CσT e CσT ,

Formula formula_161: ∥∇ W f L (x)∥ =∥ L ℓ=1 T L 1 √ n ∂f L ∂h ℓ ϕ(h ℓ-1 )∥ ≤ T L L ℓ=1 1 √ n ∥ ∂f L ∂h ℓ ∥∥h ℓ-1 ∥ ≤ T L L ℓ=1 1 √ n • (1 + σT /L) L-ℓ • (1 + σT /L) ℓ-1 • Cσ √ n ≤CσT e σT ,

Formula formula_162: ∥h ℓ ∥ ≤ (1 + σT /L) ℓ ∥h 0 ∥,(59)

Formula formula_163: ∥ ∂f L ∂h ℓ ∥ ≤ (1 + σT /L) L-ℓ ∥∂f L /∂h L ∥, (60

Formula formula_164: ) for all ℓ ∈ {0, 1, • • • , L}.

Formula formula_165: ∥∇ W f L (x) -∇ W f θ (x)∥ = T 0 1 √ n ∂f ∂h t ϕ(h t )dt - L ℓ=1 T L 1 √ n ∂f L ∂h ℓ ϕ(h ℓ-1 ) ≤ 1 √ n L ℓ=1 t ℓ t ℓ-1 ∂f ∂h t ϕ(h t ) - ∂f L ∂h ℓ ϕ(h ℓ-1 ) dt ≤ 1 √ n L ℓ=1 t ℓ t ℓ-1 ∂f ∂h t - ∂f L ∂h ℓ ∥h t ∥ + ∥ ∂f L ∂h ℓ ∥∥h t -h ℓ-1 ∥dt ≤ C √ n L ℓ=1 t ℓ t ℓ-1 √ nL -1 dt ≤C L ℓ=1 L -2 = CL -1 .

Formula formula_166: ∇ W f L (x), ∇ W f L (x) -⟨∇ W f θ (x), ∇ W f θ (x)⟩ ≤ ∇ W f L (x), ∇ W f L (x) -∇ W f θ (x) + ∇ W f L (x) -∇ W f θ (x), ∇ W f θ (x) ≤∥∇ W f L (x)∥ • ∥∇ W f L (x) -∇ W f θ (x)∥ + ∥∇ W f L (x) -∇ W f θ (x)∥∥∇ W f θ (x)∥ ≤CL -1 ,

Formula formula_167: ∇ W f L (x), ∇ W f L (x) -⟨∇ W f θ (x), ∇ W f θ (x)⟩ ≤ CL -1 . (61

Formula formula_168: )

Formula formula_169: ∇ U f L (x), ∇ U f L (x) -⟨∇ U f θ (x), ∇ U f θ (x)⟩ = ⟨x, x⟩ ∂f L (x) ∂h 0 (x) , ∂f L (x) ∂h 0 (x) -⟨x, x⟩ ∂f θ (x) ∂h(x, 0) , ∂f θ (x) ∂h(x, 0) .

Formula formula_170: ∂f L (x) ∂h 0 (x) , ∂f L (x) ∂h 0 (x) - ∂f θ (x) ∂h(x, 0) , ∂f θ (x) ∂h(x, 0) ≤ ∂f L (x) ∂h 0 (x) , ∂f L (x) ∂h 0 (x) - ∂f θ (x) ∂h(x, 0) + ∂f L (x) ∂h 0 (x) - ∂f θ (x) ∂h(x, 0) , ∂f θ (x) ∂h(0, x) ≤∥ ∂f L (x) ∂h 0 (x) ∥ • ∥ ∂f L (x) ∂h 0 (x) - ∂f θ (x) ∂h(x, 0) ∥ + ∥ ∂f L (x) ∂h 0 (x) - ∂f θ (x) ∂h(x, 0) ∥ • ∥ ∂f θ (x) ∂h(0, x) ∥ ≤CL -1 ,

Formula formula_171: ∇ U f L (x), ∇ U f L (x) -⟨∇ U f θ (x), ∇ U f θ (x)⟩ ≤ CL -1 .

Formula formula_172: ∇ θ f L (x), ∇ θ f L (x) -⟨∇ θ f θ (x), ∇ θ f θ (x)⟩ ≤ CL -1 . (62

Formula formula_173: )

Formula formula_174: K ∞ (x, x) = lim n→∞ K θ (x, x) = lim n→∞ ∇ θ f θ , ∇ θ fθ = lim n→∞ lim L→∞ ∇ θ f L θ , ∇ θ f L θ = lim L→∞ lim n→∞ ∇ θ f L θ , ∇ θ f L θ = lim L→∞ K L ∞ (x, x).

Formula formula_175: Σ 0,t (x, x) =δ 0,t σ 2 u d x ⊤ x, ∀t ∈ [0, T ] (63) Σ t,s (x, x) =σ 2 w Eϕ(u t )ϕ(ū s ), ∀t, s ∈ [0, T ],(64)

Formula formula_176: E(u t , ūs ) = Σ 0,0 + t 0 s 0 Σ t ′ ,s ′ (x, x)dt ′ ds ′ .(65)

Formula formula_177: Σ * (x, x) = Σ T,T (v, x) = σ 2 v Eϕ(u T )ϕ(ū T )(66)

Formula formula_178: K t,s (x, x) = T t T s K t ′ ,s ′ (x, x) Σt ′ ,s ′ (x, x)dt ′ ds ′ , (67

Formula formula_179: )

Formula formula_180: K ∞ (x, x) = Σ * (x, x) + T 0 T 0 Σ t,s (x, x)K t,s (x, x)dtds + Σ 0,0 (x, x)K 0,0 (x, x).(68)

Formula formula_181: Definition 2. A kernel function k : X × X → R is strictly positive definite (SPD) if, for any finite set of distinct points x 1 , • • • , x N ∈ X, the symmetric matrix K = [k(x i , x j )] N i,j=1

Formula formula_182: K θ (x, x) = ⟨∇ v f θ (x), ∇ v f θ (x)⟩ + ⟨∇ W f θ (x), ∇ W f θ (x)⟩ + ⟨∇ U f θ (x), ∇ U f θ (x)⟩ .

Formula formula_183: ⟨∇ v f θ (x), ∇ v f θ (x)⟩ → Σ * (x, x). Hence, to show K ∞ is SPD, it is sufficient to show Σ * is SPD.

Formula formula_184: L < ∞. Proposition 6. Suppose ϕ is L 1 -Lipschitz continuous. If ϕ is non-polynomial nonlinear, then Σ L is SPD on S d-1 for 1 ≤ L < ∞.

Formula formula_185: ⟨f, g⟩ := E x∼N (0,1) f (x)g(x).

Formula formula_186: ∥f ∥ 2 = ⟨f, f ⟩ = E x∼N (0,1) |f (x)| 2 < ∞.

Formula formula_187: h n (x) = (-1) n e x 2 2 d n dx n e -x 2 2 ,

Formula formula_188: φ(ρ) := E (u,v)∼Nρ ϕ(u)ϕ(v).

Formula formula_189: K ϕ : S d-1 × S d-1 → R is defined by K ϕ (x, x) := φ(x T x).

Formula formula_190: ϕ(x) = ∞ n=0 a n h n (x),(69)

Formula formula_191: φ(ρ) = ∞ n=0 a 2 n ρ n . (70

Formula formula_192: )

Formula formula_193: [-1, 1] → R with f (ρ) = ∞ n=0 b n ρ n , the kernel K f : S d-1 × S d-1 → R defined by K f (x, x) := f (x T x)

Formula formula_194: σ 2 u d ⟨x, x⟩ and we have Σ 1 (x, x) = σ 2 w E (u,v)∼N (0,G 0 ) [ϕ(u)ϕ(v)]

Formula formula_195: G 0 = σ 2 u d 1 ⟨x, x⟩ ⟨x, x⟩ 1 .

Formula formula_196: Σ 1 (x, x) = σ 2 w μ(x T x),

Formula formula_197: := ϕ(σ u x/ √ d).

Formula formula_198: Σ 1 (x, x) = σ 2 w μ(x T x) = σ 2 w ∞ n=0 a 2 n (x T x) n .

Formula formula_199: 1. E[u ℓ ūℓ ] = C 0,0 (x, x) + κ 2 ℓ i,j=1 C i,j (x, x) is SPD for all ℓ ∈ {1, 2, • • • , L + 1},

Formula formula_200: C 1,ℓ (x, x) = Eϕ(u 0 )ϕ(ū 1 ) = Eϕ(u 0 )ϕ(ū 0 ) = C 1,1 ,

Formula formula_201: E[z 0 zℓ ] = δ 0,ℓ C 0,0 (x, x). Thus, C 1,ℓ is SPD for all ℓ. Recall that E[u ℓ ūk ] = C 0,0 (x, x) + ℓ i=1 k j=1 C i,j (x, x). Using this relation, we can write E[u ℓ ūℓ ] = C 0,0 (x, x) + C 1,1 (x, x) + 2 ℓ i=2 C 1,i (x, x) + ℓ i,j=2 C i,j (x, x).

Formula formula_202: {x 1 , • • • , x N } and nonzero a ∈ R N such that 0 = N i,j=1 a i a j C ℓ+1,ℓ+1 (x i , x j ) = i,j a i a j E[ϕ(u ℓ i )ϕ(u ℓ j )] = E N i=1 a i ϕ(u ℓ i ) 2 .

Formula formula_203: u ℓ := (u ℓ 1 , • • • , u ℓ N ) ∈ R N is

Formula formula_204: Σ * (x, x) = E[ϕ(u * )ϕ(ū * )]

Formula formula_205: S L (x, x) = C 0,0 (x, x) + κ 2 L ℓ,k=1 C ℓ,k (x, x) → S * (x, x), as L → ∞.(71)

Formula formula_206: 1. S L (x, x) = S L (x, x)

Formula formula_207: S L+1 (x, x) = S L (x, x) + 2 L ℓ=1 C ℓ,L+1 (x, x) + C L+1,L+1 (x, x).

Formula formula_208: • • • , L + 1} we have C ℓ,L+1 (x, x) = Eϕ(u ℓ-1 )ϕ(u L ) = Eϕ(ū ℓ-1 )ϕ(ū L ) = C ℓ,L+1 (x, x),

Formula formula_209: E[u ℓ-1 u L ] = C 0,0 (x, x) + ℓ-1,L i,j=1 C i,j (x, x) = C 0,0 (x, x) + ℓ-1,L i,j=1 C i,j (x, x) = E[ū ℓ-1 ūL ].

Formula formula_210: S L (x, x) -S L (x, x) = 1 2 ∥x -x∥ 2 + 1 2 E g L (x) -g L (x) 2 ,

Formula formula_211: 1. 0 < S * (x, x) = S * (x, x) < ∞ 2. S * (x, x) ≥ S * (x, x

Formula formula_212: Proof. Observe that S * (x, x) = x T x + E [g(x)g(x)] ,

Formula formula_213: L→∞ g L (x) for g L (x) = L -1 L ℓ=1 ϕ(u ℓ )

Formula formula_214: S * (x, x) -S * (x, x) = 1 2 ∥x -x∥ 2 + 1 2 E |g(x) -g(x)| 2 ,

Formula formula_215: Σ * (x, x) = E (u,ū)∼S * (x,x) [ϕ(u)ϕ(ū)] = ∞ n=0 a 2 n [S * (x, x)/S 0 ] n ,

Formula formula_216: i } N i=1 from S d-1 and nonzero c ∈ R N . Observe that N i,j=1 c i c j Σ * (x i , x j ) = ∞ n=0 a 2 n S -n 0 N i,j=1 c i c j [S * (x i , x j )] n = ∞ n=0 a 2 n S -n 0 N i,j=1 c i c j x T i x j + Eg(x i )g(x j ) n ,

Formula formula_217: N i,j=1 c i c j x T i x j + Eg(x i )g(x j ) n =c T (XX T + Eg(X)g(X) T ) ⊙n c ≥c T (XX T ) ⊙n c = N i,j=1 c i c j x T i x j n ,

Formula formula_218: N i,j=1 c i c j Σ * (x i , x j ) ≥ ∞ n=0 a 2 n S -n 0 N i,j=1 c i c j x T i x j n = N i,j=1 c i c j ∞ n=0 a 2 n (x T i x j /S 0 ) n = N i,j=1 c i c j E (u,ū)∼x T i xj /S0 [ψ(u)ψ(ū)] = N i,j=1 c i c j E (u,ū)∼x T i xj [ψ(u/ S 0 )ψ(ū/ S 0 )] = N i,j=1 c i c j E (u,ū)∼x T i xj [ϕ(u)ϕ(ū)] = N i,j=1 c i c j Σ 1 (x i , x j ),

Formula formula_219: L(θ k ) ≤ 1 - ηλ 0 16 k L(θ 0 ),(73)

Formula formula_220: H ∞ ∈ R N ×N defined as H ∞ ij = K ∞ (x i , x j ).

Formula formula_221: ∥v k -v 0 ∥, ∥W k -W 0 ∥, ∥U k -U 0 ∥ ≤ C∥X∥∥u 0 -y∥/λ 0 ,and

Formula formula_222: ∥u k -y∥ ≤ 1 - ηλ 0 16 k ∥u 0 -y∥, for all n ≥ max n 0 , n 1 , C 0 N 3 log(N/δ)/λ 3 0 .

Formula formula_223: L(θ) := N i=1 1 2 (f θ (x i ) -y i ) 2 . (74

Formula formula_224: )

Formula formula_225: ∂L(θ) ∂v = N i=1 σ v √ n ϕ(h T (x i ))(f θ (x i ) -y i ),(75)

Formula formula_226: ∂L(θ) ∂W = N i=1 T 0 σ w √ n ϕ(h t (x i )) ⊗ λ t (x i )dt (f θ (x i ) -y i ),(76)

Formula formula_227: ∂L(θ) ∂U = N i=1 σ u √ d [x i ⊗ λ 0 (x i )] (f θ (x i ) -y i ).(77)

Formula formula_228: θ k+1 = θ k -η ∂L(θ k ) ∂θ .(78)

Formula formula_229: 1. ∥v i ∥, ∥W i ∥, ∥U i ∥ ≤ C √ n, 2. ∥u i -y∥ ≤ (1 -ηα 2 0 ) i ∥u 0 -y∥,

Formula formula_230: α 0 := σ min σv √ n Φ 0 .

Formula formula_231: σ v = 1, σ w = σ, σ u / √ d = 1 and L 1 = L 2 = 1.

Formula formula_232: ∥ ∂f θ ∂v ∥ = ∥ 1 √ n ϕ(h T )∥ ≤ 1 √ n ∥U ∥∥x∥e σT ∥W ∥/ √ n .

Formula formula_233: ∥ ∂f θ ∂W ∥ ≤ σ √ n T 0 ∥ϕ(h t )∥∥λ t ∥dt ≤ σ √ n T 0 ∥U ∥∥x∥e σt∥W ∥/ √ n • ∥v∥ √ n e σ(T -t)∥W ∥/ √ n dt =(σT ) ∥U ∥ √ n ∥v∥ √ n ∥x∥e σT ∥W ∥/ √ n .

Formula formula_234: ∥ ∂f θ ∂U ∥ ≤ ∥x∥∥λ 0 ∥ ≤ ∥x∥ • ∥v∥ √ n exp σT ∥W ∥/ √ n

Formula formula_235: ∥ ∂f θ ∂v ∥ ≤ Ce CσT ∥x∥,(79)

Formula formula_236: ∥ ∂f θ ∂W ∥ ≤ (σT )Ce CσT ∥x∥,(80)

Formula formula_237: ∥ ∂f θ ∂U ∥ ≤ Ce CσT ∥x∥.(81)

Formula formula_238: ∥v k+1 -v 0 ∥ ≤η k i=0 ∥ ∂L(θ i ) ∂v ∥ ≤η k i=0 Ce CσT ∥X∥∥u i -y∥ ≤ηCe CσT ∥X∥ k i=0 (1 -ηα 2 0 ) i ∥u 0 -y∥ ≤Ce CσT ∥X∥∥u 0 -y∥/α 2 0

Formula formula_239: Ce CσT ∥X∥∥u 0 -y∥/α 2 0 ≤ C √ n.(82)

Formula formula_240: ∥v k+1 ∥ ≤ ∥v k+1 -v 0 ∥ + ∥v 0 ∥ ≤ C √ n. Similarly, we have ∥W k+1 -W 0 ∥ ≤η k i=0 ∥ ∂L(θ i ) ∂W ∥ ≤η k i=0 (σT )Ce CσT ∥X∥∥u i -y∥ ≤η(σT )Ce CσT ∥X∥ k i=0 (1 -ηα 2 0 )∥u 0 -y∥ ≤(σT )Ce CσT ∥X∥∥u 0 -y∥/α 2 0 . Then we need to ensure (σT )Ce CσT ∥X∥∥u 0 -y∥/α 2 0 ≤ C √ n.

Formula formula_241: ∥W k+1 ∥ ≤ ∥W k+1 -W 0 ∥ + ∥W 0 ∥ ≤ C √ n. Observe that ∥U k+1 -U 0 ∥ ≤η k i=0 ∥ ∂L(θ i ) ∂U ∥ ≤η k i=0 Ce CσT ∥X∥∥u i -y∥ ≤ηCe CσT ∥X∥ k i=0 (1 -ηα 2 0 ) i ∥u 0 -y∥ ≤Ce CσT ∥X∥∥u 0 -y∥/α 2 0 . Hence, we obtain ∥U k+1 ∥ ≤ ∥U k+1 -U 0 ∥ + ∥U 0 ∥ ≤ C √ n.

Formula formula_242: u k+1 -y =u k+1 -u k + (u k -y) = ∂ ũ ∂θ ⊤ (θ k+1 -θ k ) + (u k -y) = ∂ ũ ∂θ ⊤ -η ∂u k ∂θ (u k -y) + (u k -y) = I -η ∂ ũ ∂θ ⊤ ∂u k ∂θ (u k -y) = I -η ∂u k ∂θ ⊤ ∂u k ∂θ (u k -y) + η ∂u k ∂θ - ∂ ũ ∂θ ⊤ ∂u k ∂θ (u k -y)

Formula formula_243: ∥ ∂f ∂v - ∂ f ∂v ∥ =∥ 1 √ n ϕ(h T ) - 1 √ n ϕ( hT )∥ ≤ 1 √ n ∥h T -hT ∥ ≤ C √ n ∥θ -θ∥e CσT ∥x∥

Formula formula_244: ∥ ∂f ∂W - ∂ f ∂W ∥ ≤ σ √ n ∥ T 0 ϕ(h t ) ⊗ λ t -ϕ( ht ) ⊗ λt dt∥ ≤ σ √ n T 0 ∥h t -ht ∥∥λ t ∥ + ∥ ht ∥∥λ t -λt ∥ dt ≤C σ √ n T 0 ∥θ -θ∥e Cσt ∥x∥ • e Cσ(T -t) dt ≤(σT ) C √ n ∥θ -θ∥e CσT ∥x∥.

Formula formula_245: ∥ ∂f ∂U - ∂ f ∂U ∥ ≤ ∥x∥∥λ 0 -λ0 ∥ ≤ C √ n ∥θ -θ∥e CσT ∥x∥.

Formula formula_246: ∥ ∂f ∂θ - ∂ f ∂θ ∥ = ∥ ∂f ∂v - ∂ f ∂v ∥ + ∥ ∂f ∂W - ∂ f ∂W ∥ + ∥ ∂f ∂U - ∂ f ∂U ∥ ≤ (σT ) C √ n ∥θ -θ∥e CσT ∥x∥. Then ∥ ∂u k ∂θ - ∂ ũ ∂θ ∥ ≤ (σT ) C √ n ∥θ k -θ∥e CσT ∥X∥ ≤ (σT ) C √ n ∥θ k -θ k+1 ∥e CσT ∥X∥,

Formula formula_247: ∥θ k+1 -θ k ∥ = η∥ ∂L(θ k ) ∂θ ∥ = η∥ ∂u k ∂θ ⊤ (u k -y)∥ ≤ η(σT )Ce CσT ∥X∥∥u k -y∥.

Formula formula_248: ∥ ∂u k ∂θ - ∂ ũ ∂θ ∥ ≤ η(σT ) 2 C √ n e CσT ∥X∥ 2 ∥u k -y∥,and

Formula formula_249: ∥ ∂u k ∂θ - ∂u 0 ∂θ ∥ ≤(σT ) C √ n ∥θ k -θ 0 ∥e CσT ∥X∥ ≤(σT ) C √ n e CσT ∥X∥ k-1 i=0 ∥θ i+1 -θ i ∥ ≤η(σT ) 2 C √ n e CσT ∥X∥ 2 k-1 i=0 ∥u i -y∥ ≤η(σT ) 2 C √ n e CσT ∥X∥ 2 k-1 i=0 (1 -ηα 2 0 ) i ∥u 0 -y∥ ≤(σT ) 2 C √ n e CσT ∥X∥ 2 ∥u 0 -y∥/α 2 0 ≤α 0 /2,

Formula formula_250: √ n ≥ C(σT ) 2 e CσT ∥X∥ 2 ∥u 0 -y∥/α 3 0 . (84

Formula formula_251: )

Formula formula_252: σ min ∂u k ∂θ ≥ σ min ∂u 0 ∂θ -∥ ∂u k ∂θ - ∂u 0 ∂θ ∥ ≥ α 0 /2.

Formula formula_253: λ min ∂u k ∂θ T ∂u k ∂θ ≥ α 2 0 /4.

Formula formula_254: ∥u k+1 -y∥ ≤ 1 -ηα 2 0 /4 ∥u k -y∥ + η 2 (σT ) 3 C √ n e CσT ∥X∥ 3 ∥u k -y∥ 2 ≤ 1 -ηα 2 0 /4 + η 2 (σT ) 3 C √ n e CσT ∥X∥ 3 ∥u 0 -y∥ ∥u k -y∥ = 1 -η α 2 0 /4 -η(σT ) 3 C √ n e CσT ∥X∥ 3 ∥u 0 -y∥ ∥u k -y∥

Formula formula_255: ∥v k -v 0 ∥, ∥W k -W 0 ∥, ∥U k -U 0 ∥ ≤ C∥X∥∥u 0 -y∥/λ 0 ,(85)

Formula formula_256: ∥u k -y∥ ≤ 1 - ηλ 0 8 k ∥u 0 -y∥,(86)

Formula formula_257: ∥h t ∥ ≤ ∥U ∥∥x∥ exp σt √ n ∥W ∥ , (87

Formula formula_258: )

Formula formula_259: ∥λ t ∥ ≤ ∥v∥ √ n exp σ(T -t) √ n ∥W ∥ , (88

Formula formula_260: )

Formula formula_261: for all t ∈ [0, T ]

Formula formula_262: h t = h 0 + t 0 σ √ n W ϕ(h s )ds

Formula formula_263: ∥h t ∥ ≤ ∥h 0 ∥ + σ √ n ∥W ∥ t 0 ∥h s ∥ds

Formula formula_264: ∥h t ∥ ≤ ∥U ∥∥x∥ exp σt √ n ∥W ∥ , ∀t ∈ [0, T ].(89)

Formula formula_265: λ t = λ T + T t - σ √ n diag[ϕ ′ (h t )]W ⊤ λ s ds implies ∥λ t ∥ ≤ ∥λ T ∥ + σ √ n L 1 ∥W ∥ T t ∥λ s ∥ds.

Formula formula_266: ∥λ t ∥ ≤∥λ T ∥ exp T t σ∥W ∥/ √ nds ≤∥λ T ∥ exp σ∥W ∥/ √ n(T -t)

Formula formula_267: By λ T = 1 √ n diag[ϕ ′ (h T )]v

Formula formula_268: ∥h t -ht ∥ ≤∥θ -θ∥ ∥U ∥ ∥W ∥ e σt(∥W ∥+∥ W ∥)/ √ n ∥x∥ (90) ∥λ t -λt ∥ ≤∥θ -θ∥ ∥v∥ ∥W ∥ e σ(T -t)(∥W ∥+∥ W ∥)/ √ n / √ n(91)

Formula formula_269: h t -ht = (h 0 -h0 ) + σ √ n t 0 W ϕ(h s ) -W ϕ( hs ) ds

Formula formula_270: ∥h t -ht ∥ ≤∥h 0 -h0 ∥ + σ √ n t 0 ∥W -W ∥∥h s ∥ + ∥ W ∥∥h s -hs ∥ ds ≤∥h 0 -h0 ∥ + σ √ n ∥W -W ∥ t 0 ∥U x∥ exp σs∥W ∥/ √ n ds + σ √ n ∥ W ∥ t 0 ∥h s -hs ∥ds

Formula formula_271: σ √ n ∥U x∥∥W -W ∥ t 0 exp σs∥W ∥/ √ n ds = σ √ n ∥U x∥∥W -W ∥ • σ √ n ∥W ∥ -1 e σt∥W ∥/ √ n -1 = ∥U ∥ ∥W ∥ ∥W -W ∥ e σt∥W ∥/ √ n -1 ∥x∥.

Formula formula_272: ∥h t -ht ∥ ≤ ∥h 0 -h0 ∥ + ∥W -W ∥ ∥U ∥ ∥W ∥ e σ∥W ∥t/ √ n -1 ∥x∥ e σ∥ W ∥t/ √ n ≤ ∥U -Ū ∥ + ∥W -W ∥ ∥U ∥ ∥W ∥ e σt(∥W ∥+∥ W )∥/ √ n ∥x∥.

Formula formula_273: λ t -λt = (λ T -λT ) + σ √ n T t diag[ϕ ′ (h s )]W ⊤ λ s -diag[ϕ ′ ( hs )] W ⊤ λs ds.

Formula formula_274: ∥λ t -λt ∥ ≤ ∥λ T -λT ∥ + σ √ n T t ∥W -W ∥∥λ s ∥ + ∥ W ∥∥λ s -λs ∥ ds

Formula formula_275: σ √ n ∥W -W ∥ ∥v∥ √ n T t exp σ(T -s) √ n ∥W ∥ ds ≤ σ √ n ∥W -W ∥ ∥v∥ √ n σ √ n ∥W ∥ -1 e σ(T -t)∥W ∥/ √ n -1 = 1 √ n ∥W -W ∥ ∥v∥ ∥W ∥ e σ(T -t)∥W ∥/ √ n -1 .

Formula formula_276: ∥λ t -λt ∥ ≤ ∥λ T -λT ∥ + 1 √ n ∥W -W ∥ ∥v∥ ∥W ∥ e σ(T -t)∥W ∥/ √ n -1 e σ(T -t)∥ W ∥/ √ n ≤ 1 √ n ∥v -v∥ + ∥W -W ∥ ∥v∥ ∥W ∥ e σ(T -t)(∥W ∥+∥ W ∥)/ √ n

Formula formula_277: ∥u∥ ≤ σ 2N log(N/δ),(92)

Formula formula_278: σ 2 := Σ * (x, x) for x ∈ S d-1 . Proof. Fix x, denote u := f θ (x) = v T ϕ(h T (x))/ √ n.

Formula formula_279: |P (u ≥ ε) -P (z ≥ ε)| ≤ δ/2,

Formula formula_280: P (u ≥ ε) ≤ δ/2 + P (z ≥ ε) ≤ δ/2 + e -ε 2 /2σ 2 ≤ δ,

Formula formula_281: P (|u| ≥ ε) ≤ δ.

Formula formula_282: P (∥u∥ ≥ ε 0 ) =P (∥u∥ 2 ≥ ε 2 0 ) = P ( N i=1 |u i | 2 ≥ ε 2 0 ) ≤ N i=1 P (|u i | 2 ≥ ε 2 0 /N ) = N i=1 P (|u i | ≥ ε 0 / √ N ) ≤δ,

Formula formula_283: N i=1 x i ≥ ε) ≤ N i=1 P (x i ≥ ε/N ) and ε 0 := σ 2N log(N/δ).

Formula formula_284: ḣt = σ w √ n W f (h t , t), ∀t ∈ [0, T ],
