Dual Gauss-Newton Directions for Deep Learning

Abstract: Gauss-Newton (a.k.a. prox-linear) directions can be computed by solving an optimization subproblem that trade-offs between a partial linearization of the objective function and a proximity term. In this paper, we study the possibility to leverage the convexity of this subproblem in order to instead solve the corresponding dual. As we show, the dual can be advantageous when the number of network outputs is smaller than the number of network parameters. We propose a conjugate gradient algorithm to solve the dual, that integrates seamlessly with autodiff through the use of linear operators and handles dual constraints. We prove that this algorithm produces descent directions, when run for any number of steps. Finally, we study empirically the advantages and current limitations of our approach compared to various popular deep learning solvers.
