Keywords: Rank minimization, weight decay, deep learning
TL;DR: We find that neural network parameters tend toward low rank and alignment, and weight decay promotes this effect.
Abstract: We empirically study the evolution of the singular values and vectors of neural network weights across a wide variety of practical architectures and domains, including CNNs for image classification, LSTMs for speech recognition, and Transformers for language modeling. Across these settings, we observe that (i) large singular values grow much faster, decreasing the effective ranks of weight matrices, (ii) this growth occurs despite weak alignment between neighboring layers' singular vectors, a common assumption in prior theoretical work, and (iii) weight decay promotes both rank minimization, and neighboring layer alignment. Since these architectures are far from idealized linear neural networks, our observations extend the predictions of existing theory to more practical settings.
Student Paper: Yes
Submission Number: 56
Loading