Keywords: Low-Rank, SGD, Implicit Bias, Rank, Rank Minimization, Weight Decay
Abstract: We explore the implicit bias of Stochastic Gradient Descent (SGD) toward learning low-rank weight matrices during the training of deep neural networks. Through theoretical analysis and empirical validation, we demonstrate that this rank-minimizing bias becomes more pronounced with smaller batch sizes, higher learning rates, or stronger weight decay. Unlike previous studies, our analysis does not rely on restrictive assumptions about data, convergence, optimality of the learned weight matrices, network architecture, making it applicable to a wide range of neural network architectures of any width or depth. We further show that weight decay is essential for inducing this low-rank bias. Finally, we empirically explore the connection between this bias and generalization, finding that it has a noticeable, yet marginal, effect on the test performance.
Submission Number: 47
Loading