everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
\textit{Memorization} is the ability of deep models to learn verbatim arbitrary inputs from the training data. One of the most popular means of calculating memorization scores (i.e., the probability that a point is memorized) is via the pseudo Leave-One-Out (pLOO) method proposed by~\citet{feldman2020longtail}. However, this technique suffers from two shortcomings: it is computationally prohibitive (as it requires training thousands of models) and it produces inaccurate scores. The goal of this work is to overcome both these limitations simultaneously. To do so, we take the following approach: \textbf{First}, we demonstrate that the major source of pLOO's computation bottleneck is its execution on the entire dataset, not just the memorized points. We find running pLOO on all the points is unnecessary since most of them are not even memorized. \textbf{Second}, we develop a simple proxy to identify the memorized points without having to run pLOO in the first place. To do so, we study the model training cycle and find that memorized points are learned towards the last iterations. We build a simple proxy based on this observation and find that our proxy: \textit{a)} is strongly correlated with the actual memorization scores (Pearson score $<-0.95$) across all our models and datasets and \textit{b)} requires only a single model (instead of the thousands needed by pLOO). However, our proxy does not provide the exact memorization scores. \textbf{Third}, to calculate these, we incorporate our proxy into the pLOO method, resulting in pLOO\textsubscript{\textit{improved}}. In doing so, we show that our pLOO\textsubscript{\textit{improved}} reduces both computational overhead (by over 90%) and the error in the approximated memorization scores (by over 65%). Therefore, our work makes it possible to study memorization in large datasets and real-world models while requiring only a fraction of the computational resources.