An unintended consequence of the vast pretraining of Large Language Models (LLMs) is the verbatim memorization of fragments of their training data, which may contain sensitive or copyrighted information. In recent years, unlearning has emerged as a solution to effectively remove sensitive knowledge from models after training. Yet, recent work has shown that supposedly deleted information can still be extracted by malicious actors through various attacks. Still, current attacks retrieve sets of possible candidate generations and are unable to pinpoint the output that contains the actual target information.We propose activation steering as a method for exact information retrieval from unlearned LLMs. We introduce a novel approach to generating steering vectors, named Anonymized Activation Steering. Additionally, we develop a simple word frequency method to pinpoint the correct answer among a set of candidates when retrieving unlearned information. Our evaluation across multiple unlearning techniques and datasets demonstrates that activation steering successfully recovers general knowledge (e.g., widely known fictional characters) while revealing limitations in retrieving specific information (e.g., details about non-public individuals). Overall, our results demonstrate that exact information retrieval from unlearned models is possible, highlighting a severe vulnerability of current unlearning techniques.
Papers points severe evaluation flaw in the Unlearning works
In this paper, the authors present an attack to extract supposedly "unlearned" information after unlearning in LLMs.
To achieve this, they rely on the concept of activation steering to guide activations in a certain direction during the LLM inference stage.
In particular, they determine the steering direction by analyzing the activations of the original unlearned prompts and anonymized prompts, then apply the steering direction with a scale factor.
The biggest strength of the paper is that it demonstrates many unlearning techniques still retain knowledge about the unlearned content. This highlights the need for a more robust assessment of unlearning techniques.
Overall, the paper’s findings are important to the LLM unlearning community.
Simple and fun
Summary:
This paper presents a method to exploit vulnerabilities in the unlearning process of Large Language Models (LLMs), highlighting the weaknesses of current unlearning techniques. The proposed approach, called Activation Steering, enables precise extraction of information that was supposedly unlearned. By manipulating activation patterns, the authors demonstrate that even when information is removed from a model, it can still be recovered with high accuracy under certain conditions. The study reveals that current unlearning methods are inadequate, as they do not fully eliminate sensitive knowledge, leaving models susceptible to targeted attacks. This work underscores the need for more robust unlearning strategies to prevent unauthorized data retrieval.
Strengths:
Novelty: The introduction of activation steering as a technique to retrieve unlearned information is innovative. It advances beyond existing methods by focusing on exact information retrieval, rather than generating sets of possible candidates.
Simplicity: The method is straightforward, relying on simple activation manipulations and prompt anonymization. This simplicity makes it relatively easy to implement and understand, which enhances its practical applicability.
Direct Identification of Unlearning Issues: The paper clearly pinpoints the shortcomings of existing unlearning techniques, showing that current methods fail to completely eliminate sensitive data from LLMs.
Weaknesses:
- Subjective Evaluation Criteria: The use of keyword frequencies as a proxy for answer correctness could introduce bias and may not fully capture the semantic accuracy of responses.
- Lack of tasks: The paper focuses primarily on unlearning scenarios without exploring how the method might perform on a broader range of tasks. Expanding the experiments to include different types of tasks could provide a more comprehensive evaluation of the method's capabilities.
- Unclear Process Description: The description of the method's process, especially the details of generating steering vectors and the anonymization strategy, is somewhat unclear. More clarity and step-by-step explanation are needed to ensure that readers can easily understand and reproduce the technique.
Questions
My question is why was the temperature set to 2 in the experiments? Do different temperatures affect the effectiveness of hacking?
Recommendation
Based on the simplicity of the approach and its effectiveness in clearly identifying the weaknesses in unlearning techniques, I would recommend an accept. Addressing the issues of scalability, generalization, and task diversity could further strengthen the impact of this research.