Keywords: Machine Unlearning, Contrastive Learning, Representation-based
Abstract: To remove a designated set of undesirable knowledge from Large Language Models (LLMs), various unlearning approaches have been proposed. Existing approaches typically define the target knowledge through fixed textual expressions and then prevent the model from using it in those specific expressions, catering to only the textual form of the undesirable knowledge, resulting in brittle “forgetting” as unlearnt models may still recover the same knowledge when the expressions are paraphrased or altered. This research thus revisits unlearning in the realistic and failure-prone setting of identifier–attribute (IA) knowledge, where undesirable knowledge cannot be fully captured by fixed expressions. We formalize knowledge extraction under relaxed elicitation conditions by marginalizing over the hidden distribution of query textual expression strategy. This reframes unlearning as minimizing extraction risk over expression variability.
%—retaining the standard forget/retain goal, but grounding it in a more faithful formulation of how knowledge is elicited.
Instead of infeasibly sampling over latent prompts, we propose ConRep, a representation-based approach that enforces the invariants implied by the distributional formulation: retains remain stable and surface-invariant, while forgets are repelled and dispersed toward low-information regions in the model's representation space. To evaluate unlearning trustworthily and thoroughly, we build a benchmark ClinicIA, which comprises comprehensive knowledge probing under diverse task formats, spanning unlearning across the settings of two representative knowledge-provenance regimes. Across evaluation tasks and regimes, our approach, ConRep, outperforms prior approaches with remarkable performance with robust forgetting, while preserving the knowledge it should maintain and the LLM's general utility.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 7830
Loading