- Abstract: How optimization influences the generalization ability of a DNN is still an active area of research. This work aims to unveil and study a factor of influence: the speed at which each layer trains. In our preliminary work, we develop a visualization technique and an optimization algorithm to monitor and control the layer rotation rate, a tentative measure of layer-level training speed, and show that it has a remarkably consistent and substantial impact on generalization. Our experiments further suggest that weight decay's and adaptive gradients methods' impact on both generalization performance and speed of convergence are solely due to layer rotation rate changes compared to vanilla SGD, offering a novel interpretation of these widely used techniques, and providing supplementary evidence that layer-level training speed indeed impacts generalization. Besides these fundamental findings, we also expect that on a practical level, the tools we introduce will reduce the meta-parameter tuning required to get the best generalization out of a deep network.
- Keywords: generalization, optimization, vanishing gradients, experimental, fundamental research
- TL;DR: This paper provides empirical evidence that 1) the speed at which each layer trains influences generalization and 2) this phenomenon is at the root of weight decay's and adaptive gradient methods' impact on generalization.