The Role of Learning and Memorization in Relabeling-based Unlearning for LLMs

ICLR 2026 Conference Submission19447 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Unlearning, LLM, Statistical Learning, AI safety
TL;DR: We study how the nature of response generation (learning-based versus memorization- based) affects the unlearning efficiency for the relabeling-based method.
Abstract: This work studies how the nature of a response generated by a large language model (LLM) impacts the efficiency of relabeling-based unlearning, a common unlearning technique that trains the model to fit an ``unlearn'' set (i.e., a dataset that we wish the model to unlearn) with alternative responses to prevent it from generating unwanted outputs that align with the unlearn set. We distinguish between two different ways LLMs can generate undesirable outputs: learning-based generation, where the model learns an underlying rule connecting the input and the response (e.g., social stereotypes), and memorization-based generation, where the model memorizes specific information about a given input (e.g., private information like a phone number). We demonstrate that relabeling-based unlearning can be detrimental to the model performance when the undesirable outputs are generated based on learning-based generation whereas it is more effective with memorization-based generation. We provide theoretical justifications for this through the lens of hypothesis testing, showing that memorization-based hypotheses are more stable in presence of fabricated evidence that contradicts the hypothesis' prediction and more flexible to produce alternative responses. Our empirical results further support our findings by showing a clear performance gap in relabeling-based unlearning under these two types of data generation mechanisms.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 19447
Loading