Keywords: optimization
TL;DR: Anon is a provably stable optimizer with tunable adaptivity, unifying SGD and Adam while outperforming both.
Abstract: Adaptive optimizers such as Adam have achieved great success in training large-scale models like large language models and diffusion models. However, they often generalize worse than non-adaptive methods, such as SGD on classical architectures like CNNs. We identify a key cause of this performance gap: adaptivity in pre-conditioners, which limits the optimizer's ability to adapt to diverse optimization landscapes. To address this, we propose **Anon** (**A**daptivity **N**on-restricted **O**ptimizer with **N**ovel convergence technique), a novel optimizer with **continuously tunable adaptivity** $\gamma\in \mathbb{R}$
, allowing it to interpolate between SGD-like and Adam-like behaviors and even extrapolate beyond both. To ensure convergence across the entire adaptivity spectrum, we introduce *incremental delay update (IDU)*, a novel mechanism that is more flexible than AMSGrad's hard max-tracking strategy and enhances robustness to gradient noise. We theoretically establish convergence guarantees under both convex and non-convex settings. Empirically, Anon consistently outperforms state-of-the-art optimizers on representative image classification, diffusion, and language modeling tasks. These results demonstrate that adaptivity can serve as a valuable tunable design principle, and Anon provides the first unified and reliable framework capable of bridging the gap between classical and modern optimizers and surpassing their advantageous properties.
Primary Area: optimization
Submission Number: 13105
Loading