Small ReLU networks are powerful memorizers: a tight analysis of memorization capacityDownload PDF

Chulhee Yun, Suvrit Sra, Ali Jadbabaie

06 Sept 2019 (modified: 05 May 2023)NeurIPS 2019Readers: Everyone
Abstract: We study finite sample expressivity, i.e., memorization power of ReLU networks. We show that 3-layer ReLU networks with $\Omega(\sqrt N)$ hidden nodes can perfectly memorize most datasets with $N$ points. We also prove that width $\Omega(\sqrt N)$ is necessary and sufficient for memorizing $N$ data points, proving tight bounds on memorization capacity. For deeper networks, we show that an $L$-layer network with $W$ parameters in the hidden layers can memorize $N$ data points if $W = \Omega(N)$. Combined with a recent upper bound $O(WL\log W)$ on VC dimension, our construction is nearly tight for any fixed $L$. Subsequently, we analyze memorization capacity of residual networks under a general position assumption; we prove results that substantially reduce the known requirement of $N$ hidden nodes. Finally, we study dynamics of stochastic gradient descent (SGD), and show that when initialized near a memorizing global minimum of the empirical risk, SGD quickly finds a nearby point with small empirical risk.
CMT Num: 9023
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview