Joint Descent: Training and Tuning SimultaneouslyDownload PDF

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Withdrawn SubmissionReaders: Everyone
Keywords: First Order Optimization, Zeroth Order Optimization
Abstract: Typically in machine learning, training and tuning are done in an alternating manner: for a fixed set of hyperparameters $y$, we apply gradient descent to our objective $f(x, y)$ over trainable variables $x$ until convergence; then, we apply a tuning step over $y$ to find another promising setting of hyperparameters. Because the full training cycle is completed before a tuning step is applied, the optimization procedure greatly emphasizes the gradient step, which seems justified as first-order methods provides a faster convergence rate. In this paper, we argue that an equal emphasis on training and tuning lead to faster convergence both theoretically and empirically. We present Joint Descent (JD) and a novel theoretical analysis of acceleration via an unbiased gradient estimate to give an optimal iteration complexity of $O(\sqrt{\kappa}n_y\log(n/\epsilon))$, where $\kappa$ is the condition number and $n_y$ is the dimension of $y$. This provably improves upon the naive classical bound and implies that we essentially train for free if we apply equal emphasis on training and tuning steps. Empirically, we observe that an unbiased gradient estimate achieves the best convergence results, supporting our theory.
One-sentence Summary: This paper shows that faster optimization can be done by tuning hyperparameters along side training the differentiable variables.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Supplementary Material: zip
Reviewed Version (pdf): https://openreview.net/references/pdf?id=hCosro4bEB
5 Replies

Loading