Just How Flexible are Neural Networks in Practice?

Ravid Shwartz-Ziv; Micah Goldblum; Arpit Bansal; C. Bayan Bruss; Yann LeCun; Andrew Gordon Wilson

Just How Flexible are Neural Networks in Practice?

Ravid Shwartz-Ziv, Micah Goldblum, Arpit Bansal, C. Bayan Bruss, Yann LeCun, Andrew Gordon Wilson

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: general machine learning (i.e., none of the above)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: neural networks, approximation theory, model complexity, generalization

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: The paper investigates the practical flexibility of neural networks, revealing that optimization methods, architecture, and data intricacies significantly impact their capacity to fit data.

Abstract: It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters, underpinning notions of overparameterized and underparameterized models. In practice, however, we only find solutions accessible via our training procedure, including the optimizer and regularizers, limiting flexibility. Moreover, the exact parameterization of the function class, built into an architecture, shapes its loss surface and impacts the minima we find. In this work, we examine the ability of neural networks to fit data in practice. Our findings indicate that: (1) standard optimizers find minima where the model can only fit training sets with significantly fewer samples than it has parameters; (2) convolutional networks are more parameter-efficient than MLPs and ViTs, even on randomly labeled data; (3) whereas stochastic training is thought to have a regularizing effect, SGD actually finds minima that fit more training data than full-batch gradient descent; (4) the difference in capacity to fit correctly labeled and incorrectly labeled samples predicts generalization; (5) ReLU activation functions enable fitting more data despite being designed to avoid vanishing and exploding gradients in deep architectures.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4234

Loading