Abstract: Nonconvex optimization problems have always been one focus in deep learning, in which many fast adaptive algorithms based on momentum are applied. However, the full gradient computation of high-dimensional feature vector in the above tasks become prohibitive. To reduce the computation cost for optimizers on nonconvex optimization problems typically seen in deep learning, this work proposes a randomized block-coordinate adaptive optimization algorithm, named RAda, which randomly picks a block from the full coordinates of the parameter vector and then sparsely computes its gradient. We prove that RAda converges to a δ<math><mi is="true">δ</mi></math>-accurate solution with the stochastic first-order complexity of O(1/δ2)<math><mrow is="true"><mi is="true">O</mi><mrow is="true"><mo is="true">(</mo><mn is="true">1</mn><mo is="true">/</mo><msup is="true"><mrow is="true"><mi is="true">δ</mi></mrow><mrow is="true"><mn is="true">2</mn></mrow></msup><mo is="true">)</mo></mrow></mrow></math>, where δ<math><mi is="true">δ</mi></math> is the upper bound of the gradient’s square, under nonconvex cases. Experiments on public datasets including CIFAR-10, CIFAR-100, and Penn TreeBank, verify that RAda outperforms the other compared algorithms in terms of the computational cost.
Loading