Keywords: Feature Learning, Symmetry Learning, Theory of Deep Learning, Weight Decay
TL;DR: We describe the low-dimensional features learned by large depth DNNs, using a Taylor expansion of the representation cost around infinite depth.
Abstract: Previous work has shown that DNNs with
large depth $L$ and $L_{2}$-regularization are biased towards learning
low-dimensional representations of the inputs, which can be interpreted
as minimizing a notion of rank $R^{(0)}(f)$ of the learned function
$f$, conjectured to be the Bottleneck rank. We compute finite depth
corrections to this result, revealing a measure $R^{(1)}$ of regularity
which bounds the pseudo-determinant of the Jacobian $\left\|Jf(x)\right\|\_\+$
and is subadditive under composition and addition. This formalizes
a balance between learning low-dimensional representations and minimizing
complexity/irregularity in the feature maps, allowing the network
to learn the `right' inner dimension. Finally, we prove the conjectured
bottleneck structure in the learned features as $L\to\infty$: for
large depths, almost all hidden representations are approximately
$R^{(0)}(f)$-dimensional, and almost all weight matrices $W_{\ell}$
have $R^{(0)}(f)$ singular values close to 1 while the others are
$O(L^{-\frac{1}{2}})$. Interestingly, the use of large learning rates
is required to guarantee an order $O(L)$ NTK which in turns guarantees
infinite depth convergence of the representations of almost all layers.
Supplementary Material: zip
Submission Number: 13040
Loading