Feature Learning and Random Features in Standard Finite-Width Convolutional Neural Networks: An Empirical Study

Maxim Samarin; Volker Roth; David Belius

Feature Learning and Random Features in Standard Finite-Width Convolutional Neural Networks: An Empirical Study

Maxim Samarin, Volker Roth, David Belius

Published: 20 May 2022, Last Modified: 05 May 2023UAI 2022 PosterReaders: Everyone

Keywords: Neural Tangent Kernel, Feature Learning, Random Feature Models, Lazy Training, Convolutional Neural Networks

TL;DR: Empirical study of training LeNet and AlexNet at different widths with respect to feature learning and random feature models in the context of NTK theory.

Abstract: The Neural Tangent Kernel is an important milestone in the ongoing effort to build a theory for deep learning. Its prediction that sufficiently wide neural networks behave as kernel methods, or equivalently as random feature models arising from linearized networks, has been confirmed empirically for certain wide architectures. In this paper, we compare the performance of two common finite-width convolutional neural networks, LeNet and AlexNet, to their linearizations on common benchmark datasets like MNIST and modified versions of it, CIFAR-10 and an ImageNet subset. We demonstrate empirically that finite-width neural networks, generally, greatly outperform the finite-width linearization of these architectures. When increasing the problem difficulty of the classification task, we observe a larger gap which is in line with common intuition that finite-width neural networks perform feature learning which finite-width linearizations cannot. At the same time, finite-width linearizations improve dramatically with width, approaching the behavior of the wider standard networks which in turn perform slightly better than their standard width counterparts. Therefore, it appears that feature learning for non-wide standard networks is important but becomes less significant with increasing width. We furthermore identify cases where both standard and linearized networks match in performance, in agreement with NTK theory, and a case where a wide linearization outperforms its standard width counterpart.

Supplementary Material: zip

5 Replies

Loading