- TL;DR: We investigate which properties of an objective function favor sign gradient descent.
- Abstract: Sign gradient descent has become popular in machine learning due to its favorable communication cost in distributed optimization and its good performance in neural network training. However, we currently do not have a good understanding of which geometrical properties of the objective function determine the relative speed of sign gradient descent compared to standard gradient descent. In this work, we frame sign gradient descent as steepest descent with respect to the maximum norm. We review the steepest descent framework and the related concept of smoothness with respect to arbitrary norms. By studying the smoothness constant resulting from the $L^\infty$-geometry, we isolate properties of the objective which favor sign gradient descent relative to gradient descent. In short, we find two requirements on its Hessian: (i) some degree of ``diagonal dominance'' and (ii) the maximal eigenvalue being much larger than the average eigenvalue. We also clarify the meaning of a certain separable smoothness assumption used in previous analyses of sign gradient descent. Experiments verify the developed theory.
- Keywords: Sign gradient descent, signSGD, steepest descent, Adam