Keywords: Optimization, None-convex Optimization, Stochastic gradient descent, adaptive step size, gradient diversity
TL;DR: We proposed an adaptive step size for SGD based on gradient diversity
Abstract: Optimizing machine learning models often requires careful tuning of parameters, especially the learning rate. Traditional methods involve exhaustive searches or adopting pre-established rates, both with drawbacks. The former is computationally intensive, a concern amplified by the trend toward larger models like large language models (LLM). The latter risks suboptimal model training. Consequently, there’s growing research on adaptive and parameter-free approaches to reduce reliance on manual step size tuning. While adaptive gradient methods like AdaGrad, RMSProp, and Adam aim to adjust learning rates dynamically, they are still reliant on learning rate parameters dependent on problem-specific characteristics. Our work explores the interplay between step size and gradient dissimilarity, introducing a ”Diversity adjusted adaptive step” that adapts to different levers of dissimilarity in sampled gradients within the SGD algorithm. We also provide approximate algorithms to compute this step size efficiently while maintaining performance.
Submission Number: 83
Loading