['6,10c6,15', '< In recent years, deep neural networks have achieved remarkable success across a wide range of applications. Among these advancements, Neural Ordinary Differential Equations (ODEs) (Chen et al., 2018b) stand out due to their continuous nature and parameter efficiency through shared parameters. Unlike conventional neural networks with discrete layers, Neural ODEs model the evolution of hidden states as a continuous-time differential equation, allowing them to better capture dynamic systems. This parameter-sharing mechanism ensures consistent dynamics throughout the continuous transformation and reduces the number of parameters, improving both memory efficiency and computational complexity. These unique properties make Neural ODEs particularly effective not only for traditional machine learning tasks like image classification (Chen et al., 2018b) and natural language processing (Rubanova et al., 2019), but also for more complex tasks involving continuous processes, such as time series analysis (Kidger et al., 2020), reinforcement learning (Du et al., 2020), and diffusion models (Song et al., 2020). However, while these features offer flexibility and efficiency, they also introduce significant challenges during training, particularly in gradient computation and convergence analysis.', '< One of the key challenges in training Neural ODEs is accurately computing gradients. Unlike traditional networks, where backpropagation can be computed through a discrete chain of layers, Neural ODEs require solving forward and backward ODEs using numerical solvers. These solvers introduce numerical errors, which can lead to inaccurate gradients and slow convergence or even suboptimal model performance (Rodriguez et al., 2022). Moreover, ensuring the well-posedness of ODE solutions during training is nontrivial. According to the Picard-Lindelöf Theorem, solutions may not always exist or may only exist locally, which can cause training divergence or significant numerical errors (Gholami et al., 2019;Ott et al., 2020;Sander et al., 2022). Even with advanced solvers (Zhuang et al., 2020a;b;Matsubara et al., 2021;Ko et al., 2023), it remains an open problem whether simple first-order methods, such as stochastic gradient descent (SGD), can reliably train Neural ODEs to convergence. While discretizing Neural ODEs as finite-depth networks offers a potential solution, it results in a deeper computation graph (Zhuang et al., 2020a;b), raising questions about whether the gradients computed in this manner truly match those of the continuous model.', "< Another essential challenge lies in analyzing the training dynamics of Neural ODEs. The optimization problem in training neural networks is inherently nonconvex, making theoretical analysis difficult. Recent work by Jacot et al. (2018) has shown that the training dynamics of overparameterized networks can be understood through the lens of the Neural Tangent Kernel (NTK), which converges to a deterministic limit as network width increases. This convergence has enabled researchers to establish global convergence guarantees for gradient-based methods in overparameterized regimes, provided the NTK remains strictly positive definite (SPD) (Du et al., 2019a;Allen-Zhu et al., 2019;Nguyen, 2021;Gao et al., 2021). The analysis of the NTK's strict positive definiteness began with Daniely et al. (2016), who introduced the concept of dual activation for two-layer networks, later extended to deeper, finite networks (Jacot et al., 2018;Du et al., 2019a). However, these results are limited to networks with discrete layers, raising the question of whether the same properties hold for continuous models like Neural ODEs.", '< In this paper, we address these challenges by exploring the impact of activation functions on training Neural ODEs. We show that activation function properties-specifically, smoothness and nonlinearity-play critical roles in determining the well-posedness of ODE solutions and the spectral properties of the NTK. Through our analysis, we demonstrate that smooth activation functions lead to globally unique solutions for both forward and backward ODEs, ensuring the stability of the training process. Additionally, we extend existing results on the NTK from discrete-layered neural networks to continuous models, demonstrating that the NTK for Neural ODEs is well-defined. Importantly, we find that a higher degree of nonlinearity in the activation function not only helps maintain the SPD property of the limiting NTK, but also practically speeds up Neural ODE convergence.', '< 1. We investigate the significance of the smoothness of activation functions for the well-posedness of forward and backward ODEs in Neural ODEs. Using random matrix theory, we demonstrate the existence of globally unique solutions. Additionally, we show that no additional regularity is needed if forward and backward ODEs are combined in a weakly coupled ODE system. 2. We propose a new mathematical framework for studying continuous models from the approximation theory perspective. By using a sequence of finite-depth neural networks to approximate Neural ODEs, we show that key properties like activation and gradient propagation are preserved as depth approaches infinity. This allows us to apply the Moore-Osgood theorem from functional analysis to prove that the NTK of Neural ODEs is well-defined. 3. Unfortunately, the SPD property of the NTK may not hold at infinite depth, even we can show every finite-depth approximation satisfies it. To address this, we conduct a fine-grained analysis and derive an integral form for the limiting NTK of Neural ODEs. This form reveals that the NTK remains SPD if the activation function is non-polynomial. Leveraging this integral representation provides valuable insights into continuous models and may inspire further research. 4. We conduct a series of numerical experiments to support our theoretical findings. Beyond validating our analysis, these experiments also provide practical guidelines for training Neural ODEs. We show that activation function smoothness and nonlinearity accelerate convergence and improve performance (see Figure 78). Conversely, improper ODE scaling leads to damping from accumulated numerical errors (see Figure3), while adaptive solvers struggle with efficiency in large-scale Neural ODEs (see Figure 14), causing instability and high computational overhead.', '---', '> Deep neural networks have revolutionized numerous fields, achieving unprecedented success in diverse applications. Within this landscape of innovation, Neural Ordinary Differential Equations (Neural ODEs) (Chen et al., 2018b) represent a paradigm shift, distinguishing themselves through their intrinsically continuous nature and remarkable parameter efficiency. Unlike traditional neural networks, which are characterized by discrete layers, Neural ODEs elegantly model the evolution of hidden states as a continuous-time differential equation. This allows for a more faithful representation of dynamic systems and offers inherent advantages in terms of memory efficiency and computational complexity due to their parameter-sharing mechanism. These unique attributes have propelled Neural ODEs to prominence, making them exceptionally effective for not only conventional machine learning tasks such such as image classification (Chen et al., 2018b) and natural language processing (Rubanova et al., 2019), but also for sophisticated applications involving continuous processes, including time series analysis (Kidger et al., 2020), reinforcement learning (Du et al., 2020), and diffusion models (Song et al., 2020). Despite these compelling advantages, the training of Neural ODEs introduces significant theoretical and practical challenges, particularly in the accurate computation of gradients and the rigorous analysis of their convergence properties.', '> A primary challenge in the training of Neural ODEs stems from the accurate computation of gradients. In contrast to conventional neural networks, where gradients are efficiently propagated through a discrete chain of layers via backpropagation, Neural ODEs necessitate the numerical solution of both forward and backward (adjoint) ODEs. This reliance on numerical solvers inevitably introduces inherent numerical errors, which can significantly compromise gradient accuracy, leading to sluggish convergence, or even suboptimal model performance (Rodriguez et al., 2022). Furthermore, ensuring the well-posedness of ODE solutions throughout the training trajectory is a non-trivial task. As stipulated by the Picard-Lindelöf Theorem, solutions to ODEs may not always exist globally or may only be locally unique, a condition that can precipitate training divergence or substantial numerical inaccuracies (Gholami et al., 2019;Ott et al., 2020;Sander et al., 2022). Even with the advent of sophisticated numerical solvers (Zhuang et al., 2020a;b;Matsubara et al., 2021;Ko et al., 2023), it remains an open and critical question whether fundamental first-order optimization methods, such as stochastic gradient descent (SGD), can reliably achieve convergence when applied to Neural ODEs. While the strategy of discretizing Neural ODEs into finite-depth networks offers a potential avenue for gradient computation, this approach typically results in an even deeper computational graph (Zhuang et al., 2020a;b), thereby raising fundamental questions regarding the fidelity of these discretized gradients to those of the true continuous model.', "> A second, equally critical, challenge resides in rigorously analyzing the training dynamics of Neural ODEs. The optimization landscape for training deep neural networks is notoriously non-convex, which fundamentally complicates theoretical analysis. A significant theoretical advancement was made by Jacot et al. (2018), who demonstrated that the intricate training dynamics of overparameterized networks can be effectively characterized through the framework of the Neural Tangent Kernel (NTK). The NTK is shown to converge to a deterministic limit as the network width approaches infinity. This seminal convergence result has empowered researchers to establish robust global convergence guarantees for gradient-based optimization methods in overparameterized regimes, contingent upon the NTK maintaining its strictly positive definite (SPD) property throughout training (Du et al., 2019a;Allen-Zhu et al., 2019;Nguyen, 2021;Gao et al., 2021). The foundational analysis of the NTK's strict positive definiteness was pioneered by Daniely et al. (2016) with the introduction of dual activation for two-layer networks, subsequently extended to deeper, finite-layered architectures (Jacot et al., 2018;Du et al., 2019a). Crucially, these profound theoretical insights are predominantly confined to networks with discrete layers, leaving a significant void: it remains an open and unresolved question whether these essential properties, particularly the SPD of the NTK, extend to continuous-depth models such as Neural ODEs.", '> In this paper, we directly tackle these fundamental challenges by meticulously investigating the profound impact of activation functions on the training dynamics of Neural ODEs. Our comprehensive analysis unequivocally demonstrates that specific properties of activation functions—namely, smoothness and nonlinearity—are paramount in governing both the well-posedness of ODE solutions and the critical spectral properties of the Neural Tangent Kernel (NTK). Through rigorous theoretical derivations, we establish that smooth activation functions are instrumental in guaranteeing globally unique solutions for both the forward and backward (adjoint) ODEs, thereby ensuring the inherent stability and reliability of the entire training process. Crucially, we successfully extend existing foundational results on the NTK, which were previously confined to discrete-layered neural networks, to the realm of continuous models, proving that the NTK for Neural ODEs is rigorously well-defined. Furthermore, our findings highlight that a higher degree of nonlinearity in the activation function is not only essential for preserving the strictly positive definite (SPD) property of the limiting NTK, but also empirically translates into significantly faster convergence rates for Neural ODEs.', '> Our key contributions are summarized as follows:', '> 1.  **Well-Posedness and Gradient Equivalence:** We rigorously establish that the smoothness of activation functions is crucial for the well-posedness of both forward and backward ODEs in Neural ODEs, guaranteeing globally unique solutions. We further demonstrate that, under sufficient smoothness, the "optimize-then-discretize" and "discretize-then-optimize" gradient computation methods yield equivalent gradients, a critical insight for stable training.', '> 2.  **NTK for Continuous Models:** We introduce a novel mathematical framework that extends the Neural Tangent Kernel (NTK) theory from discrete-layered networks to continuous-depth Neural ODEs. By approximating Neural ODEs with sequences of finite-depth networks and leveraging the Moore-Osgood theorem, we prove that the NTK of Neural ODEs is rigorously well-defined in the infinite-width limit.', '> 3.  **Strict Positive Definiteness (SPD) of NTK:** We provide a fine-grained analysis of the limiting NTK for Neural ODEs, deriving its integral form. This integral representation reveals that the NTK retains its strictly positive definite (SPD) property if the activation function is non-polynomial, a crucial condition for global convergence. This result addresses a long-standing challenge regarding NTK properties at infinite depth.', '> 4.  **Global Convergence Guarantees:** Building upon the established well-posedness and SPD properties of the NTK, we provide, for the first time, global convergence guarantees for gradient descent in overparameterized Neural ODEs. This theoretical breakthrough bridges the gap between discrete and continuous-depth models in the context of global optimization.', '> 5.  **Empirical Validation and Practical Guidelines:** We conduct extensive numerical experiments that not only validate our theoretical findings but also provide practical insights. Our experiments demonstrate that smooth and nonlinear activation functions accelerate convergence and improve performance. Furthermore, we offer guidelines on proper ODE scaling to prevent damping and highlight the efficiency challenges of adaptive solvers in large-scale Neural ODEs.', '16c21', '< In this paper, we consider a simple Neural ODE f (x; θ)1 defined as follows', '---', '> In this paper, we rigorously analyze a canonical form of Neural ODE, denoted as f (x; θ), which is formally defined as:', '18c23', '< where h t ∈ R n is the hidden state that satisfies the following ordinary differential equation', '---', '> where h t ∈ R n represents the hidden state whose evolution is governed by the following ordinary differential equation:', '20c25', '< where ϕ is the activation function2 , x ∈ R d is input, U ∈ R n×d , W ∈ R n×n , and v ∈ R n are learnable parameters. These parameters, denoted by θ := vec(U , W , v), are randomly initialized (Glorot & Bengio, 2010;He et al., 2015) from standard Gaussian distribution:', '---', '> Here, ϕ is the chosen activation function2 , x ∈ R d is the input vector, and U ∈ R n×d , W ∈ R n×n , and v ∈ R n constitute the set of learnable parameters. These parameters, collectively denoted by θ := vec(U , W , v), are initialized randomly (Glorot & Bengio, 2010;He et al., 2015) from a standard Gaussian distribution:', '22c27', '< ∼ N (0, 1), (3) with variance hyperparameters σ u , σ w , σ v > 0. Due to the continuous nature of Neural ODEs, computing gradients through standard backpropagation is not feasible. Instead, we use the adjoint method (Chen et al., 2018b) to compute gradients by solving the backward ODE:', '---', '> ∼ N (0, 1), (3) with associated variance hyperparameters σ u , σ w , σ v > 0. A critical aspect of Neural ODEs is that their continuous nature precludes the use of standard backpropagation for gradient computation. Instead, we employ the adjoint method (Chen et al., 2018b), which involves solving a backward (adjoint) ODE to compute the necessary gradients. The adjoint state λ t follows the differential equation:', '24c29', '< √ n, and λt = -σ w diag(ϕ ′ (h t ))W T λ t / √ n, (4) where λ t is the adjoint state. When both forward and backward ODEs are well-posed, we can compute the gradients of f θ with respect to (w.r.t.) the parameters θ as follows', '---', '> √ n, and λt = -σ w diag(ϕ ′ (h t ))W T λ t / √ n, (4) where λ t is the adjoint state. Provided that both the forward and backward ODEs are well-posed, the gradients of the model output f θ with respect to the parameters θ can be accurately computed as:', '27c32', '< Further details on these derivations are provided in Appendix B. In Section 3, we demonstrate that if ϕ is Lipschitz continuous, the forward and backward ODEs have globally unique solutions. Additionally, in Section 5, we prove this well-posedness holds throughout the entire training process.', '---', '> Further comprehensive details on these derivations are elucidated in Appendix B. A key theoretical result, established in Section 3, demonstrates that if the activation function ϕ is Lipschitz continuous, both the forward and backward ODEs possess globally unique solutions. This crucial well-posedness property is further proven in Section 5 to hold consistently throughout the entire training process.', '30c35', '< Given a training dataset {(x i , y i )} N i=1 , the objective is to learn a parameter θ that minimizes the empirical loss:', '---', '> Given a training dataset {(x i , y i )} N i=1 , the primary objective is to learn a parameter vector θ that minimizes the empirical loss function:', '32,35c37,40', '< where u = (u 1 , • • • , u N ) is the prediction vector with u i = f (x i ; θ), and y = (y 1 , • • • , y N ) is the output vector. Gradient descent with a learning rate η > 0 is used to minimize the loss: θ k+1 = θ k -η∇ θ L(θ k ).', '< (7) Following (Du et al., 2019a) and (Jacot et al., 2018), under some required conditions, the evolution of the prediction vector u k can be approximated as follows:', '< u k+1 -y ≈ I -ηH k (u k -y),(8)', '< where H k ∈ R N ×N is a Gram matrix defined through the NTK (Jacot et al., 2018):', '---', '> where u = (u 1 , • • • , u N ) is the prediction vector with u i = f (x i ; θ), and y = (y 1 , • • • , y N ) is the corresponding output vector. The minimization of this loss is typically achieved using gradient descent with a learning rate η > 0: θ k+1 = θ k -η∇ θ L(θ k ).', '> (7) As established by prior works (Du et al., 2019a;Jacot et al., 2018), under specific conditions, the evolution of the prediction vector u k during training can be accurately approximated as:', '> u k+1 -y ≈ I -ηH k (u k -y),(8)', '> where H k ∈ R N ×N is a Gram matrix constructed from the Neural Tangent Kernel (NTK) (Jacot et al., 2018), defined as:', '37,38c42,43', '< The training dynamics Eq. ( 8) is governed by the spectral property of the NTK Gram matrix H k . If there exists a strictly positive constant λ 0 > 0 s.t. λ min (H k ) ≥ λ 0 for all k, then the residual u ky decreases to zero exponentially, provided the learning rate η > 0 is sufficiently small (Allen-Zhu et al., 2019;Nguyen, 2021). As the parameters θ k are updated during training, the NTK K θ changes over time, making its spectral analysis challenging. Fortunately, previous research (Yang, 2020;Jacot et al., 2018) shows that the time-varying K θ converges to a deterministic limiting NTK K ∞ as the network width n → ∞. By leveraging this result and the concept of dual activation (Daniely et al., 2016), we can study the spectral properties of K θ during training through K ∞ using perturbation theory.', '< However, these prior results apply only to finite-depth neural networks, while Neural ODEs are infinite-depth networks due to their continuous nature. Moreover, as prior analyses are based on induction techniques, there is no guarantee that these essential properties will also hold as depth tends to infinity. In Section 4.1 and Section 4.2, we introduce a new framework to study Neural ODEs as infinite-depth networks. We demonstrate that the smoothness of the ϕ is crucial to ensure these essential properties hold in Neural ODEs. Additionally, to study the spectral properties of the limiting NTK K ∞ , we provide a fine-grained analysis by expressing the limiting NTK of Neural ODEs in an integral form, which we conclude that the nonlinearity of activation plays a critical role in ensuring the strict positive definiteness of K ∞ .', '---', "> The stability and convergence of the training dynamics described by Eq. ( 8) are fundamentally governed by the spectral properties of the NTK Gram matrix H k . Specifically, if there exists a strictly positive constant λ 0 > 0 such that λ min (H k ) ≥ λ 0 for all iterations k, then the residual error u ky is guaranteed to decrease exponentially to zero, provided the learning rate η > 0 is chosen sufficiently small (Allen-Zhu et al., 2019;Nguyen, 2021). A significant challenge arises because the parameters θ k are continuously updated during training, causing the NTK K θ to vary over time, which complicates its spectral analysis. Fortunately, seminal research (Yang, 2020;Jacot et al., 2018) has demonstrated that the time-varying NTK K θ converges to a deterministic limiting NTK K ∞ as the network width n → ∞. This crucial result, coupled with the concept of dual activation (Daniely et al., 2016), enables the study of K θ 's spectral properties during training via K ∞ using perturbation theory.", '> However, these profound theoretical results are inherently limited to finite-depth neural networks. Neural ODEs, by their very definition, are continuous-depth models, effectively representing infinite-depth networks. Consequently, the inductive techniques employed in prior analyses do not directly guarantee that these essential NTK properties, particularly the strict positive definiteness (SPD), will persist as depth tends to infinity. In this paper, we bridge this critical gap. In Section 4.1 and Section 4.2, we introduce a novel mathematical framework specifically designed to analyze Neural ODEs as infinite-depth networks. We rigorously demonstrate that the smoothness of the activation function ϕ is indispensable for preserving these essential NTK properties in Neural ODEs. Furthermore, to precisely characterize the spectral properties of the limiting NTK K ∞ , we conduct a fine-grained analysis, deriving an integral form for the limiting NTK of Neural ODEs. This integral representation provides a crucial insight: the NTK remains strictly positive definite (SPD) if and only if the activation function is non-polynomial. This finding underscores the critical role of activation function nonlinearity in ensuring the well-behaved spectral properties of K ∞ .', '1466d1470', '< ']
