Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation

Robert M. Gower; Guillaume Garrigos; Nicolas Loizou; Konstantin Mishchenko; Dimitris Oikonomou; Fabian Schaipp

Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation

Robert M. Gower, Guillaume Garrigos, Nicolas Loizou, Konstantin Mishchenko, Dimitris Oikonomou, Fabian Schaipp

09 Jan 2025 (modified: 18 Jun 2025)Submitted to ICML 2025EveryoneRevisionsBibTeXCC BY 4.0

TL;DR: How to use the knowledge of optimal values to design efficient optimization methods

Abstract: We provide a general convergence theorem of an idealized stochastic Polyak step size called SPS*. Besides convexity, we only assume a local expected gradient bound, that includes locally smooth and locally Lipschitz losses as special cases. We refer to SPS* as idealized because it requires access to the loss for every training batch evaluated at a solution. It is also ideal, in that it achieves the optimal lower bound for globally Lipschitz function, and is the first Polyak step size to have a $\mathcal{O}(1/\sqrt{t})$ anytime convergence in the smooth setting. We show how to combine SPS* with momentum to achieve the same favorable rates for the last iterate. We conclude with several experiments to validate our theory, and a more practical setting showing how we can distill a teacher GPT-2 model into a smaller student model without any hyperparameter tuning.

Primary Area: Optimization

Keywords: Polyak step size, momentum, optimization theory, model distillation

Submission Number: 261

Loading