Disentangling Memorization in Language Models via Sparse Representation Intervention

Disentangling Memorization in Language Models via Sparse Representation Intervention

ACL ARR 2025 May Submission2599 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Pretrained large language models (LLMs) have become foundational tools in natural language processing (NLP), demonstrating strong performance across tasks such as summarization, question answering, and translation. However, their internal memorization mechanisms remain difficult to interpret and control. This challenge arises from the distributed and nonlinear nature of memorization in LLMs, where learned information—such as specific phrases or facts—is entangled across billions of parameters. As a result, identifying how and when memorized content is retrieved during inference remains an open problem. In this work, we propose a novel framework to uncover the relationship between input semantics and memorization in LLMs. We insert a Sparse Autoencoder (SAE) at the final hidden layer to decompose high-dimensional activations into sparse, interpretable components. To further investigate how specific input features influence memorization, we introduce Representation Fine-Tuning (REFT), a mechanism that dynamically modulates the SAE-encoded representations based on semantic interventions. Experimental results on the GPT-Neo and Pythia model families show that our method consistently outperforms both no-prompt and prompt-based baselines in memory retrieval tasks. Moreover, we demonstrate that our framework enables fine-grained analysis of how semantic variations in input tokens affect memorization behavior.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Large language models (LLMs), memorization mechanisms

Contribution Types: Model analysis & interpretability

Languages Studied: English, Chinese

Submission Number: 2599

Loading