MemLens: Uncovering Memorization in LLMs with Activation Trajectories

17 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Interpretability, Memorization
Abstract: Large language models (LLMs) are commonly evaluated on challenging benchmarks such as AIME and Math500, which are susceptible to contamination and risk at being memorized. Contamination can be explicit, where samples appear verbatim, or implicit, where samples are rephrased, perturbed, or translated but still memorized. Existing detecting baselines focuses on surface-level lexcial overlap and perplexity, performing well for explicit contamination but degrading significantly under implicit cases. We propose MemLens (An Activation Lens for Memorization Detection) to detect memorization by analyzing the probability trajectories of numeric tokens during generation. We observe that contaminated and clean samples exhibit distinct and well-separated reasoning trajectories. To further validate the observation, we inject carefully designed samples into the model through LoRA fine-tuning and observe the same trajectory patterns as in naturally contaminated data. These results provide strong evidence that MemLens captures genuine signals of memorization rather than spurious correlations.
Primary Area: interpretability and explainable AI
Submission Number: 9758
Loading