Stochastic gradient updates yield deep equilibrium kernels
Authors that are also TMLR Expert Reviewers: ~Russell_Tsuchida1
Abstract: Implicit deep learning allows one to compute with implicitly defined features, for example features that solve optimisation problems. We consider the problem of computing with implicitly defined features in a kernel regime. We call such a kernel a deep equilibrium kernel (DEKer). Specialising on a stochastic gradient descent (SGD) update rule applied to features (not weights) in a latent variable model, we find an exact deterministic update rule for the (DEKer) in a high dimensional limit. This derived update rule resembles previously introduced infinitely wide neural network kernels. To perform our analysis, we describe an alternative parameterisation of the link function of exponential families, a result that may be of independent interest. This new parameterisation allows us to draw new connections between a statistician's inverse link function and a machine learner's activation function. We describe an interesting property of SGD in this high dimensional limit: even though individual iterates are random vectors, inner products of any two iterates are deterministic, and can converge to a unique fixed point as the number of iterates increases. We find that the (DEKer) empirically outperforms related neural network kernels on a series of benchmarks.
Certifications: Expert Certification
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: In summary, roughly 15 paragraphs of new or significantly modified text have been introduced into the revision. These new paragraphs target isolated reviewer concerns, as well as more broad concerns that were raised regarding the presentation. The changes we deem to be most important are as follows. - We now include a proof sketch in the body of the paper of Theorem 4. This outlines the rough idea of the proof given in the appendix. - A new paragraph immediately before section 1.1 which intuitively describes our contributions, before they are slightly more formally introduced in section 1.1, - Some targeted changes to the text preceding equation (1), which describe the update rule as an implicit mapping mapping solutions at time step $t$ to solutions ate time step $t+1$. This is followed by emphasis that we do not consider weight updates, but rather feature updates, citing Jacot et al. and the NTK for weight updates, - Description of why $2 \times 2$ matrices suffice just before equation (3,4). We repeat this again in the experiments section, - Replaced our informal theorem statement with some text description just before beginning of section 2, - Reordered the presentation so that assumption 2(b) immediately follows assumption 2(a), - We add some targeted sentences to highlight the difference to the NNK and NTK, - Theorem 7 is now followed by a description of what the theorem says intuitively about the probability of the distance between the true DEK and its finite width approximation being large, - Added an extra paragraph to the conclusion to talk about the more general setting of update rules and implicit kernels. Cite concurrent work on differential equations and Euler's method.
Supplementary Material: pdf
Assigned Action Editor: ~Nadav_Cohen1
Submission Number: 820