- Abstract: L1 and L2 regularizers are critical tools in machine learning due to their ability to simplify solutions. However, imposing strong L1 or L2 regularization with gradient descent method easily fails, and this limits the generalization ability of the underlying neural networks. To understand this phenomenon, we investigate how and why training fails for strong regularization. Specifically, we examine how gradients change over time for different regularization strengths and provide an analysis why the gradients diminish so fast. We find that there exists a tolerance level of regularization strength, where the learning completely fails if the regularization strength goes beyond it. We propose a simple but novel method, Delayed Strong Regularization, in order to moderate the tolerance level. Experiment results show that our proposed approach indeed achieves strong regularization for both L1 and L2 regularizers and improves both accuracy and sparsity on public data sets. Our source code is published.
- TL;DR: We investigate how and why strong L1/L2 regularization fails and propose a method than can achieve strong regularization.
- Keywords: deep learning, regularization