- Original Pdf: pdf
- Keywords: Optimization for Deep Networks, Non-convex Optimization, Stochastic Optimization
- Abstract: Adaptive gradient approaches that automatically adjust the learning rate on a per-feature basis have been very popular for training deep networks. This rich class of algorithms includes Adagrad, RMSprop, Adam, and recent extensions. All these algorithms have adopted diagonal matrix adaptation, due to the prohibitive computational burden of manipulating full matrices in high-dimensions. In this paper, we show that block-diagonal matrix adaptation can be a practical and powerful solution that can effectively utilize structural characteristics of deep learning architectures to significantly improve convergence and out-of-sample generalization. We present AdaBlock, a general framework for block-diagonal matrix adaption via coordinate grouping, which includes counterparts of the aforementioned algorithms. We prove its convergence in non-convex optimization and provide generalization error bounds, highlighting benefits compared to diagonal versions. In addition, we propose two techniques enriching the AdaBlock family: i) an efficient spectrum-clipping scheme that benefits from superior generalization performance of SGD and ii) a randomized layer-wise block diagonal adaptation scheme to further reduce computational cost. Extensive experiments show that AdaBlock achieves state-of-the-art results on several deep learning tasks, and can outperform adaptive diagonal methods, vanilla SGD, as well as a modified version of full-matrix adaptation proposed very recently.