- Abstract: Training methods for deep networks are primarily variants on stochastic gradient descent. Techniques that use (approximate) second-order information are rarely used because of the computational cost and noise associated with those approaches in deep learning contexts. However, in this paper, we show how feedforward deep networks exhibit a low-rank derivative structure. This low-rank structure makes it possible to use second-order information without needing approximations and without incurring a significantly greater computational cost than gradient descent. To demonstrate this capability, we implement Cubic Regularization (CR) on a feedforward deep network with stochastic gradient descent and two of its variants. There, we use CR to calculate learning rates on a per-iteration basis while training on the MNIST and CIFAR-10 datasets. CR proved particularly successful in escaping plateau regions of the objective function. We also found that this approach requires less problem-specific information (e.g. an optimal initial learning rate) than other first-order methods in order to perform well.
- TL;DR: We show that deep learning network derivatives have a low-rank structure, and this structure allows us to use second-order derivative information to calculate learning rates adaptively and in a computationally feasible manner.
- Keywords: Deep Learning, Derivative Calculations, Optimization Algorithms