Abstract: Pretrained large language models (LLMs) have become foundational tools in natural language processing (NLP), demonstrating strong performance across tasks such as summarization, question answering, and translation. However, their internal memorization mechanisms remain difficult to interpret and control. This challenge arises from the distributed and nonlinear nature of memorization in LLMs, where learned information—such as specific phrases or facts—is entangled across billions of parameters. As a result, identifying how and when memorized content is retrieved during inference remains an open problem.
In this work, we propose a novel framework to uncover the relationship between input semantics and memorization in LLMs. We insert a Sparse Autoencoder (SAE) at the final hidden layer to decompose high-dimensional activations into sparse, interpretable components. To further investigate how specific input features influence memorization, we introduce Representation Fine-Tuning (REFT), a mechanism that dynamically modulates the SAE-encoded representations based on semantic interventions. Experimental results on the GPT-Neo and Pythia model families show that our method consistently outperforms both no-prompt and prompt-based baselines in memory retrieval tasks. Moreover, we demonstrate that our framework enables fine-grained analysis of how semantic variations in input tokens affect memorization behavior.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Large language models (LLMs), memorization mechanisms
Contribution Types: Model analysis & interpretability
Languages Studied: English, Chinese
Submission Number: 2599
Loading