On the Convergence of Adam-Type Algorithm for Bilevel Optimization under Unbounded Smoothness

Published: 14 Jun 2026, Last Modified: 14 Jun 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Adam has become one of the most popular optimizers for training modern deep neural networks, such as transformers. However, its applicability is largely restricted to single-level optimization problems. In this paper, we aim to extend vanilla Adam to tackle bilevel optimization problems, which have important applications in machine learning, such as meta-learning. In particular, we study stochastic bilevel optimization problems where the lower-level function is strongly convex and the upper-level objective is nonconvex with potentially unbounded smoothness. This unbounded smooth objective function covers a broad class of neural networks, including transformers, which may exhibit non-Lipschitz gradients. In this work, we introduce AdamBO, a single-loop Adam-type method that achieves $\widetilde{O}(\epsilon^{-4})$ oracle complexity to find $\epsilon$-stationary points, where the oracle calls involve stochastic gradient or Hessian/Jacobian-vector product evaluations. The key to our analysis is a novel randomness decoupling lemma that provides refined control over the lower-level variable. We conduct extensive experiments on various machine learning tasks involving bilevel formulations with recurrent neural networks (RNNs) and transformers, demonstrating the effectiveness of our proposed Adam-type algorithm.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: - Added ablation study for lower level update strategies (one-step SGD, multiple-step SGD, and one-step Adam) in Appendix G. - Added empirical verification of the lower-level approximation error $\\|y_t-y_t^\*\\|$ in the deep AUC experiment in Appendix H. - Added a sensitivity study of $\texttt{neumann\\_lr}$ in Appendix I. - Added how the ratio of the second-order part to the total hypergradient evolves with training epochs in Appendix J. - Moved the comparison tables and Remark 3 from Appendix F to Section 4.2 of the main paper. - Added in Section 4.3.3 a paragraph explaining how the stopping-time technique is applied, along with a roadmap paragraph describing how the lemmas connect in proving the main theorem. - Adjusted the figure legend layouts to improve visual clarity. - Fixed typos.
Supplementary Material: zip
Assigned Action Editor: ~Ju_Sun1
Submission Number: 7312
Loading