Abstract: Bi-level optimization, especially the gradient-based category, has been widely used in the deep learning community including hyperparameter optimization and meta knowledge extraction. Bi-level optimization embeds one problem within another and the gradient-based category solves the outer level task by computing the hypergradient, which is much more efficient than classical methods such as the evolutionary algorithm. In this survey, we first give a formal definition of the gradient-based bi-level optimization. Secondly, we illustrate how to formulate a research problem as a bi-level optimization problem, which is of great practical use for beginners. More specifically, there are two formulations: the single-task formulation to optimize hyperparameters such as regularization parameters and the distilled data, and the multi-task formulation to extract meta knowledge such as the model initialization. With a bi-level formulation, we then discuss four bi-level optimization solvers to update the outer variable including explicit gradient update, proxy update, implicit function update, and closed-form update. Last but not least, we conclude the survey by pointing out the great potential of gradient-based bi-level optimization on science problems (AI4Science).
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: In response to the reviewer's feedback, we have clarified and expanded upon key sections of the paper to address the perceived lack of coherence and connection between the diverse topics discussed.
1. We have provided further explanation on why the explicit gradient update method is a popular choice in the deep learning community for hyperparameter optimization, highlighting its historical prominence, intuitive appeal, and accessibility due to specific deep learning libraries.
2. We have also addressed the usage of different optimization methods across subfields. For instance, while the explicit gradient update method is frequently employed for hyperparameters optimization due to its computational efficiency, the more demanding closed-form kernel method is generally preferred for data optimization tasks that prioritize precision.
3. We've clarified our criteria for single-task classification, explaining why certain problems require a breakdown into two levels.
Assigned Action Editor: ~Seungjin_Choi1
Submission Number: 1065
Loading