Keywords: generalization, optimization
TL;DR: This paper presents empirical evidence supporting the discovery of an indicator of generalization: the evolution across training of the cosine distance between each layer's weight vector and its initialization.
Abstract: Our work presents empirical evidence that layer rotation, i.e. the evolution across training of the cosine distance between each layer's weight vector and its initialization, constitutes an impressively consistent indicator of generalization performance. Compared to previously studied indicators of generalization, we show that layer rotation has the additional benefit of being easily monitored and controlled, as well as having a network-independent optimum: the training procedures during which all layers' weights reach a cosine distance of 1 from their initialization consistently outperform other configurations -by up to 20% test accuracy. Finally, our results also suggest that the study of layer rotation can provide a unified framework to explain the impact of weight decay and adaptive gradient methods on generalization.