Scalable Nested Optimization For Deep Learning

Jonathan Lorraine

Published: 2024, Last Modified: 03 Nov 2025undefined 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Gradient-based optimization has been critical to the success of machine learning, updating a single set of parameters to minimize a single loss. A growing number of applications rely on a generalization of this, where we have a bilevel or nested optimization of which subsets of parameters update on different objectives nested inside each other. We focus on motivating examples of hyperparameter optimization and generative adversarial networks. However, naively applying classical methods often fails when we look at solving these nested problems on a large scale. In this thesis, we build tools for nested optimization that scale to deep learning setups. In Chapter 2, we provide an explicit, differentiable approximation to how neural network weights best-respond – or optimize – for their loss. We train a hypernetwork that takes in hyperparameters and outputs neural network weights. We use this hypernetwork for hyperparameter optimization. In Chapter 3, we explore an algorithm that implicitly approximates the neural network weights best-response, using the implicit function theorem. We apply this to hyperparameter optimization, showing we can tune as many hyperparameters as neural network weights. In Chapter 4, we augment simultaneous gradient descent to mitigate rotational dynamics. We do this by generalizing gradient descent with momentum to have a complex-valued momentum while retaining real-valued parameter updates. In Chapter 5, we generalize strategies for finding diverse solutions in single-objective optimization to finding diverse solutions in setups where multiple agents optimize for their own objective. We do this by taking the Ridge-rider algorithm and generalizing their branching criteria to occur at bifurcations using connections to Lyapunov exponents.

External IDs:dblp:phd/ca/Lorraine24