Deep Learning Over-Parameterization: the Shallow Fallacy

Published: 03 Nov 2023, Last Modified: 23 Dec 2023NLDL 2024EveryoneRevisionsBibTeX
Keywords: deep learning, overparametrization, learning theory
TL;DR: Very large deep learning models can be trained with very few examples.
Abstract: A major tenet of conventional wisdom dictates that models should not be over-parameterized: the number of free parameters should not exceed the number of training data points. This tenet originates from centuries of shallow learning, primarily in the form of linear or logistic regression. It is routinely applied to all kinds of data analyses and modeling and even to infer properties of the brain. However, through a variety of precise mathematical examples, we show that this conventional wisdom is completely wrong as soon as one moves from shallow to deep learning. In particular, we construct sequences of both linear and non-linear deep learning models whose number of parameters can grow to infinity, while the training set can remain very small (e.g. a single example). In deep models, the parameter space is partitioned into large equivalence classes. Learning can be viewed as a communication process where information is communicated from the data to the synaptic weights. The information in the training data only needs to specify an equivalence class of the parameters, and not the exact parameter values. As such, the number of training examples can be significantly smaller than the number of free parameters.
Permission: pdf
Submission Number: 16
Loading