An Optimization Principle Of Deep Learning?Download PDF

25 Sep 2019 (modified: 24 Dec 2019)ICLR 2020 Conference Blind SubmissionReaders: Everyone
  • Original Pdf: pdf
  • Abstract: Training deep neural networks (DNNs) has achieved great success in recent years. Modern DNN trainings utilize various types of training techniques that are developed in different aspects, e.g., activation functions for neurons, batch normalization for hidden layers, skip connections for network architecture and stochastic algorithms for optimization. Despite the effectiveness of these techniques, it is still mysterious how they help accelerate DNN trainings in practice. In this paper, we propose an optimization principle that is parameterized by $\gamma>0$ for stochastic algorithms in nonconvex and over-parameterized optimization. The principle guarantees the convergence of stochastic algorithms to a global minimum with a monotonically diminishing parameter distance to the minimizer and leads to a $\mathcal{O}(1/\gamma K)$ sub-linear convergence rate, where $K$ is the number of iterations. Through extensive experiments, we show that DNN trainings consistently obey the $\gamma$-optimization principle and its theoretical implications. In particular, we observe that the trainings that apply the training techniques achieve accelerated convergence and obey the principle with a large $\gamma$, which is consistent with the $\mathcal{O}(1/\gamma K)$ convergence rate result under the optimization principle. We think the $\gamma$-optimization principle captures and quantifies the impacts of various DNN training techniques and can be of independent interest from a theoretical perspective.
8 Replies