Research Area: Science of LMs
Keywords: unlearning, forgetting, generalization
TL;DR: When language models (LMs) are trained to “unlearn” a skill, this unlearning often fails to generalize outside the specific examples used for fine-tuning.
Abstract: When language models (LMs) are trained to ``unlearn'' a skill, does this unlearning generalize? We study the behavior of LMs after fine-tuned on data for a target task (e.g. sentiment analysis) in which the labels have been randomized, a popular unlearning method. While LMs consistently learn to generate near-random predictions for individual training examples in the unlearning set, there is extreme variability across tasks in whether LM predictions change on examples outside the unlearning set. In some tasks (like sentiment analysis), unlearning generalizes robustly, and causes models to generate random outputs on all sentiment-type inputs; in other tasks (like physical commonsense reasoning and scientific question answering) unlearning produces almost no generalization at all, and models continue to perform the task accurately even for examples very similar to those that appeared in the training set. Across tasks, we find that dataset difficulty is not predictive of whether a behavior can be unlearned; instead, generalization in unlearning is (weakly) predicted by the confidence of LMs' initial task predictions and the variability of LM representations of unlearning data, with low confidence and low variability both associated with greater generalization. Finally, we show that even generalizable unlearning is shallow: linear probes trained on LMs' representations can still perform tasks reliably after unlearning. Our results highlight the difficulty and unpredictability of performing targeted skill removal from models via fine-tuning.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 779
Loading