Keywords: Gradient descent, learning rate, sample complexity, non-convex non-smooth optimization
TL;DR: We establish sample complexity bounds for tuning the learning rate and momentum parameters in gradient descent for non-convex and non-smooth functions.
Abstract: Gradient-based iterative optimization methods are the workhorse of modern machine learning. They crucially rely on careful tuning of parameters like learning rate, and yet one typically relies on heuristic approaches without formal near-optimality guarantees. Recent work by Gupta and Roughgarden studies how to learn a good step-size in gradient descent. However, like most of the literature with theoretical guarantees for gradient-based optimization, their theoretical results rely on strong assumptions on the function class including convexity and smoothness which do not hold in typical applications. In this work, we develop novel analytical tools for provably tuning the step-size in gradient-based algorithms that apply to non-convex and non-smooth functions. We obtain matching sample complexity bounds for learning the step-size in gradient descent shown for smooth, convex functions in prior work (up to logarithmic factors) but for a much broader class of functions. Our analysis applies to gradient descent for neural networks with piecewise-polynomial activation functions (including ReLU activation). Furthermore, we show the versatility of our framework by applying it to tuning momemtum and step-size simultaneously.
Student Paper: No
Submission Number: 79
Loading