Keywords: Trustworthy LLM; Trustworthy AI; Responsible AI; Large Language Model; AI Privacy; AI Secuirty
TL;DR: We study how to perform unlearning in large language model.
Abstract: We study how to perform unlearning in large language models (LLMs), which can forget an LLM's harmful behaviors learned in its pretraining stage or remove the effect of training samples that need to be deleted per user requests. It highlights the application of aligning LLMs with human preferences. Compared to the standard RLHF (RL from human feedback) solution for aligning LLMs, unlearning has three benefits. (1) It only requires negative examples, which are cheaper to collect than high-quality (i.e. positive) examples in RLHF that require human effort. (2) It is less computationally expensive; the cost is comparable to fine-tuning. (3) It is more effective when we know which training samples cause the misbehavior. To the best of our knowledge, our work is the first to explore LLM unlearning, as well as to set up the settings, goals, and evaluations in LLM unlearning. Our empirical results suggest unlearning is a promising direction for LLM alignment.
Submission Number: 82
Loading