Towards Memorization Estimation: Fast, Formal and Free

Deepak Ravikumar; Efstathia Soufleri; Abolfazl Hashemi; Kaushik Roy

Towards Memorization Estimation: Fast, Formal and Free

Deepak Ravikumar, Efstathia Soufleri, Abolfazl Hashemi, Kaushik Roy

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Deep learning has become the de facto approach in nearly all learning tasks. It has been observed that deep models tend to memorize and sometimes overfit data, which can lead to compromises in performance, privacy, and other critical metrics. In this paper, we explore the theoretical foundations that connect memorization to sample loss, focusing on learning dynamics to understand what and how deep models memorize. To this end, we introduce a novel proxy for memorization: Cumulative Sample Loss (CSL). CSL represents the accumulated loss of a sample throughout the training process. CSL exhibits remarkable similarity to stability-based memorization, as evidenced by considerably high cosine similarity scores. We delve into the theory behind these results, demonstrating that low CSL leads to nontrivial bounds on the extent of stability-based memorization and learning time. The proposed proxy, CSL, is four orders of magnitude less computationally expensive than the stability-based method and can be obtained with zero additional overhead during training. We demonstrate the practical utility of the proposed proxy in identifying mislabeled samples and detecting duplicates where our metric achieves state-of-the-art performance.

Lay Summary: Deep learning models, popular for their effectiveness in many applications like image and text processing, have a notable drawback: they often memorize the training data. This memorization can hurt their ability to perform well on new, unseen data, create privacy issues, and make them vulnerable to certain attacks. To tackle this, our research introduces a new way to measure memorization called Cumulative Sample Loss (CSL). CSL works by tracking how much each sample contributes to the model’s loss (or errors) throughout training. Interestingly, we discovered that samples with higher cumulative losses are more likely memorized by the model. Our CSL method is efficient, it can be calculated during training without extra computational costs, making it much faster than existing techniques. We validated CSL through experiments showing strong correlations with previous memorization metrics. CSL also excelled in practical applications, such as detecting mislabeled or duplicated data within datasets, significantly outperforming other approaches. By providing an efficient and effective way to measure memorization, CSL helps researchers build better, safer, and more reliable machine learning models.

Link To Code: https://github.com/DeepakTatachar/CSL-Mem

Primary Area: Deep Learning->Everything Else

Keywords: Memorization, Learning Time

Submission Number: 2165

Loading