Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $\mu$ Parametrization
Abstract: Despite deep neural networks' powerful representation learning capabilities, theoretical understanding of how networks can simultaneously achieve meaningful feature learning and global convergence remains elusive. Existing approaches like the neural tangent kernel (NTK) are limited because features stay close to their initialization in this parametrization, leaving open questions about feature properties during substantial evolution. In this paper, we investigate the training dynamics of infinitely wide, $L$-layer neural networks using the tensor program (TP) framework. Specifically, we show that, when trained with stochastic gradient descent (SGD) under the Maximal Update parametrization ($\mu$P) and mild conditions on the activation function, SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum. Our analysis leverages both the interactions among features across layers and the properties of Gaussian random variables, providing new insights into deep representation learning. We further validate our theoretical findings through experiments on real-world datasets.
Lay Summary: Artificial intelligence systems called neural networks have achieved remarkable success in tasks like image recognition and language processing. However, scientists still don't fully understand why these systems work so well. A key puzzle is whether neural networks can simultaneously do two important things: learn useful patterns from data (called "feature learning") and find the best possible solution to a problem (called "global optimization").
Previous approaches for deep L-layer networks either allow networks to learn patterns but have little understanding about global optimization, or guarantee finding good solutions but prevent meaningful learning. Our research resolves this puzzle by studying a specific way of setting up neural networks called "Maximal Update Parametrization". We prove mathematically that when networks are made very wide and trained using this approach, they can indeed do both things at once: they learn rich, meaningful patterns from data while also finding globally optimal solutions.
We validate our theory through experiments showing how networks maintain diverse, independent features throughout training. This work provides new theoretical foundations for understanding why certain AI training methods work better than others, potentially informing the design of more effective AI systems.
Primary Area: Deep Learning->Theory
Keywords: Feature learning, Neural networks, $\mu$P, Tensor Program
Submission Number: 12599
Loading