REMIND: Memorization and Unlearning in LLMs Through the Lens of Input Loss Landscapes

ACL ARR 2026 January Submission3865 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: unlearning evaluation, large language models, input loss landscape, memorization, model interpretability
Abstract: Understanding how large language models (LLMs) store, retain, and remove knowledge is critical for interpretability, reliability, and privacy compliance. We reveal a key phenomenon: machine unlearning imprints distinct geometric signatures in the model's input loss landscape (ILL), with unlearned examples forming flat, low-curvature plateaus contrasting the sharp, high-curvature basins of retained or unseen examples, even when pointwise losses overlap, exposing residual memorization through input-output behavior alone. Building on this, we introduce REMIND (Residual Memorization in Neighborhood Dynamics), a framework that diagnoses memorization states (retained, forgotten, holdout) by probing local ILL curvature over semantically coherent neighborhoods, using only loss queries and a novel embedding-proximity perturbation method for generating controlled, interpretable variants. REMIND achieves 82% multi-class ROC-AUC in aggregate evaluations, outperforming baselines like ROUGE-L and MIN-K%++, with roughly 2x higher AUC at 1% FPR, and remains robust on paraphrased inputs. This neighborhood-level geometric analysis provides a practical, interpretable lens on LLM knowledge retention and unlearning, detecting subtle residual signals missed by pointwise or aggregated metrics.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: interpretability, semantic textual similarity, knowledge tracing, model editing, evaluation methodologies, adversarial attacks
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Theory
Languages Studied: English
Submission Number: 3865
Loading