- Abstract: Hyperparameter optimization can be formulated as a bilevel optimization problem, where the optimal parameters on the training set depend on the hyperparameters. We aim to adapt regularization hyperparameters by fitting compact approximations to the best-response function, which maps hyperparameters to optimal weights and biases. We show that the best-response function for a shallow linear network with L2-regularized Jacobian can be represented by a single network whose hidden units are gated conditionally on the regularizer. Based on this, we propose a gating-based architecture for approximating best-response functions for neural nets more broadly and which can be used as a drop-in replacement for existing deep learning modules. We fit this model online during training using a gradient-based hyperparameter optimization algorithm which alternates between approximating the best-response around the current hyperparameters and optimizing the hyperparameters using the approximate best-response function. Unlike other gradient-based approaches, we do not require differentiating the training loss with respect to the hyperparameters, allowing us to tune discrete hyperparameters, data augmentation hyperparameters, and dropout probabilities. Empirically, our approach outperforms competing hyperparameter optimization methods on large-scale deep learning problems. We call our networks, which update their own hyperparameters online during training, Self-Tuning Networks.
- Keywords: hyperparameter optimization, game theory, optimization
- TL;DR: We use a hypernetwork to predict optimal weights given hyperparameters, and jointly train everything together.