On the Implicit Biases of Architecture & Gradient Descent

Jeremy Bernstein; Yisong Yue

On the Implicit Biases of Architecture & Gradient Descent

Jeremy Bernstein, Yisong Yue

Published: 28 Jan 2022, Last Modified: 22 Jun 2025ICLR 2022 SubmittedReaders: Everyone

Keywords: generalisation, function space, PAC-Bayes, NNGP, orthants, margin

Abstract: Do neural networks generalise because of bias in the functions returned by gradient descent, or bias already present in the network architecture? $\textit{¿Por qué no los dos?}$ This paper finds that while typical networks that fit the training data already generalise fairly well, gradient descent can further improve generalisation by selecting networks with a large margin. This conclusion is based on a careful study of the behaviour of infinite width networks trained by Bayesian inference and finite width networks trained by gradient descent. To measure the implicit bias of architecture, new technical tools are developed to both $\textit{analytically bound}$ and $\textit{consistently estimate}$ the average test error of the neural network--Gaussian process (NNGP) posterior. This error is found to be already better than chance, corroborating the findings of Valle-Pérez et al. (2019) and underscoring the importance of architecture. Going beyond this result, this paper finds that test performance can be substantially improved by selecting a function with much larger margin than is typical under the NNGP posterior. This highlights a curious fact: $\textit{minimum a posteriori}$ functions can generalise best, and gradient descent can select for those functions. In summary, new technical tools suggest a nuanced portrait of generalisation involving both the implicit biases of architecture and gradient descent.

One-sentence Summary: New technical tools suggest a nuanced portrait of generalisation that involves both the implicit biases of architecture and gradient descent.

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 3 code implementations](https://www.catalyzex.com/paper/on-the-implicit-biases-of-architecture/code)

20 Replies

Loading