Infinitely Deep Residual Networks: Unveiling Wide Neural ODEs as Gaussian Processes

19 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Neural ODE, Gaussian Process, Neural Tangent Kernel, Neural Network and Gaussian Process Correspondence, Kernel Methods
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We show wide Neural ODEs are Gaussian Process with strictly positive definite NNGP kernel
Abstract: While Neural Ordinary Differential Equations (Neural ODEs) have demonstrated practical numerical success, our theoretical understanding of them remains limited. Notably, we still lack convergence results and prediction performance estimates for Neural ODEs trained using gradient-based methods. Inspired by numerical analysis, one might investigate Neural ODEs by studying the limiting behavior of Residual Networks (ResNets) as depth $\ell$ approaches to infinity. However, a significant challenge arises due to the prevalent use of shared parameters in Neural ODEs. Consequently, the corresponding ResNets possess \textit{infinite depth} and \textit{shared weights} across all layers. This characteristic prevents the direct application of methods relying on Stochastic Differential Equations (SDEs) to ResNets. In this paper, we analyze Neural ODEs using an infinitely deep ResNet with shared weights. Our analysis is rooted in asymptotic analysis from random matrix theory (RMT). Consequently, we establish the Neural Network and Gaussian Process (NNGP) correspondence for Neural ODEs, regardless of whether the parameters are shared. Remarkably, the resulting Gaussian processes (GPs) exhibit distinct behaviors depending on the use of parameter sharing, setting them apart from other neural network architectures such as feed-forward, convolutional, and recurrent networks. Moreover, we prove that, in the presence of these divergent GPs, NNGP kernels are strictly positive definite when non-polynomial activation functions are applied. These findings lay the foundation for exploring the training and generalization of Neural ODEs, paving the way for future research in this domain. Additionally, we furnish an efficient dynamic programming algorithm for calculating the covariance matrix for given input data. Finally, we conduct a series of numerical experiments to support our theoretical findings.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: pdf
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1999
Loading