Gradient descent in presence of extreme flatness and steepness
Keywords: Gradient descent, Newton's method, learning rate, non-convex optimization, non-smooth optimization
TL;DR: We show that setting a good learning rate is challenging even for simple non-smooth, non-convex functions; however a modified gradient descent with second-order regularization seems promising.
Abstract: Typical theoretical analysis of convergence of gradient descent requires assumptions like convexity and smoothness that do not hold in practice. Towards understanding the challenges and potential solutions for learning in the presence of non-convex and non-smooth functions, we study the convergence of gradient descent for a simple sigmoid based function family. The functions in this family simultaneously exhibit extreme flatness and extreme sharpness, making it particularly challenging to choose a step size. We show that both small and large step sizes fail; in fact, convergence is a highly volatile function of initialization and learning rate. We observe similar challenges with a known regularized version of Newton's method. We propose a novel Newton-damped gradient descent that performs well for the non-convex, non-smooth family under study, in the sense that most settings of the learning rate lead to convergence. Our small scale experiments indicate interesting directions for both future empirical and theoretical research.
Code: ipynb
Submission Number: 85
Loading