What do larger image classifiers memorise?

Michal Lukasik; Vaishnavh Nagarajan; Ankit Singh Rawat; Aditya Krishna Menon; Sanjiv Kumar

What do larger image classifiers memorise?

Michal Lukasik, Vaishnavh Nagarajan, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar

Published: 23 Apr 2024, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: The success of modern neural networks has prompted study of the connection between memorisation and generalisation: overparameterised models generalise well, despite being able to perfectly fit (“memorise”) completely random labels. To carefully study this issue, Feldman (2019) proposed a metric to quantify the degree of memorisation of individual training examples, and empirically computed the corresponding memorisation profile of a ResNet on image classification benchmarks. While an exciting first glimpse into what real-world models memorise, this leaves open a fundamental question: do larger neural models memorise more? This aligns with the common practice of training models of different sizes, each offering different cost-quality trade-offs: while larger models are typically observed to have higher quality, it is of interest to understand whether this is merely a consequence of them memorising larger numbers of input-output patterns. We present a comprehensive empirical analysis of this question on image classification benchmarks. We find that training examples exhibit an unexpectedly diverse set of memorisation trajectories across model sizes: most samples experienced decreased memorisation under larger models, while the rest exhibit cap-shaped or increasing memorisation. We show that various proxies for the Feldman(2019) memorisation score fail to capture these fundamental trends. Lastly, we find that knowledge distillation — an effective and popular model compression technique — tends to inhibit memorisation, while also improving generalisation. Specifically, memorisation is mostly inhibited on examples with increasing memorisation trajectories, thus pointing at how distillation improves generalisation.

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Gabriel_Loaiza-Ganem1

Submission Number: 2120

Loading