SAD Neural Networks: Divergent Gradient Flows and Asymptotic Optimality via o-minimal Structures

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: theory of deep learning, gradient flow, o-minimal structures, GELU function, optimization, loss landscapes, generalized critical values, divergence
TL;DR: We prove divergence results for gradient flows for deep neural networks with analytic activation and polynomial target functions.
Abstract: We study gradient flows for loss landscapes of fully connected feedforward neural networks with commonly used continuously differentiable activation functions such as the logistic, hyperbolic tangent, softplus or GELU function. We prove that the gradient flow either converges to a critical point or diverges to infinity while the loss converges to an asymptotic critical value. Moreover, we prove the existence of a threshold $\varepsilon>0$ such that the loss value of any gradient flow initialized at most $\varepsilon$ above the optimal level converges to it. For polynomial target functions and sufficiently big architecture and data set, we prove that the optimal loss value is zero and can only be realized asymptotically. From this setting, we deduce our main result that any gradient flow with sufficiently good initialization diverges to infinity. Our proof heavily relies on the geometry of o-minimal structures. We confirm these theoretical findings with numerical experiments and extend our investigation to more realistic scenarios, where we observe an analogous behavior.
Primary Area: Theory (e.g., control theory, learning theory, algorithmic game theory)
Submission Number: 12138
Loading