To Memorize or Not to Memorize: An Analysis of Supervised Fine-Tuning in Large Language Models

18 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Memorization, Interpretation, Privacy
TL;DR: Memorization of LLM at i'th data in space and j'th terms in sequence can be nicely decomposed and easily predicted.
Abstract: Supervised fine-tuning (SFT) is a cornerstone technique for adapting large language models (LLMs) to specific domains and tasks. However, its propensity to induce verbatim memorization of training data poses significant risks to safety, privacy, and generalization. This paper presents an empirical analysis of the mechanisms underlying memorization within LLMs during SFT. Our findings confirm that SFT is a direct driver of memorization, with a clear positive correlation between the number of training epochs and the rate of verbatim data recall. The characteristics of the fine-tuning dataset are a critical determinant of memorization. We demonstrate that models trained on broad, open-domain datasets exhibit substantially more memorization than those trained on narrow, domain-specific ones, highlighting a crucial trade-off between model versatility and data containment. Furthermore, we indicate that verbatim memorization is suppressed when the training data includes inputs with high similarity paired with dissimilar outputs. We posit that this phenomenon is not a desirable mitigation strategy but rather a symptom of the model being exposed to conflicting data signals. These findings underscore the complex trade-offs in SFT and stress the importance of understanding these underlying dynamics to develop LLMs that are both capable and secure.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 10019
Loading