Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation

TMLR Paper8899 Authors

12 May 2026 (modified: 22 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We provide a general convergence theorem of an idealized stochastic Polyak step size called \texttt{SPS}*. Besides convexity, we only assume a local expected gradient bound, that includes locally smooth and locally Lipschitz losses as special cases. We refer to \texttt{SPS}* as idealized because it requires access to the loss for every training batch evaluated at a solution. It is also ideal, in that it achieves the optimal lower bound for globally Lipschitz function, and is the first Polyak step size to have an $\mathcal{O}(1/\sqrt{t})$ anytime convergence in the smooth setting. We show how to combine \texttt{SPS}* with momentum to achieve the same favorable rates for the last iterate. We conclude with several experiments to validate our theory, and a more practical setting showing how we can distill a teacher GPT-2 model into a smaller student model without any hyperparameter tuning.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Bruno_Loureiro1
Submission Number: 8899
Loading