On the Fragility of Latent Knowledge: Layer-wise Influence under Unlearning in Large Language Model

Published: 11 Jun 2025, Last Modified: 11 Jun 2025MUGen @ ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Model Unlearning
Abstract: Large language model (LLM) unlearning has emerged as an essential post-training mechanism for erasing specific knowledge or undesirable behaviors. However, forgetting target data often causes an unintended degradation in overall model utility. Although various advanced methods have explored different learning objectives to mitigate the trade-off, it remains unclear how the highly entangled internal representations in LLMs contribute to unlearning. In this work, we introduce the notion of *latent knowledge fragility* to explore the vulnerability of retained knowledge to unlearning. We develop a unified analytical approach via component-wise parameter patching that isolates and quantifies fragility in terms of different transformer blocks. We observe that the LLM encodes different levels of abstraction, from surface syntax in shallow layers to complex semantics in deeper layers, which align with different degrees of representation disruption and utility degradation. Based on the insights, we propose a lightweight framework called *Component-wise Replacement Unlearning* (CRU) that restores fragile layers (also extendable to other components) from the original model based on post-hoc validation, which allows us to obtain a hybrid model without additional training. Extensive experiments on various aspects verify that our method generally improves the trade-off between removal and retention. Our analysis highlights the non-uniform influence of different LLM layers and provides a new possibility of surgical unlearning.
Submission Number: 24
Loading