Keywords: Unlearning, LLMs, Robustness
Abstract: Large language models (LLMs) trained on web-scale data can produce toxic outputs, raising concerns for safe deployment. Prior defenses, based on applications of DPO, NPO, and similar algorithms, reduce the likelihood of harmful continuations, but not robustly so: they are vulnerable to adversarial prompting and easily undone by fine-tuning–based relearning attacks. Indeed, research has shown that these edits to the model are superficial: linear probing reveals that harmful “directions” remain present in representations. Motivated by these findings, we propose Representation Erasure-based Preference Optimization method (REPO), which builds on SURE (Sepahvand et al., 2025), an unlearning algorithm originally developed for classification. Our core strategy is to preserve the representations of benign (safe, nontoxic) generations while forcing the representations of toxic generations to converge toward their benign counterparts. This alignment is achieved
through a coupled objective, which combines a retain loss on non-toxic samples with a domain-adversarial loss on both toxic and non-toxic samples, enforced by a gradient reversal layer. Comprehensive evaluations show that REPO not only significantly reduces in-distribution and out-of-distribution toxicity compared to baselines like DPO, NPO, and RMU, but also achieves best-in-class robustness against sophisticated attacks, including relearning on forget and retain samples, and adversarial prompt injection, via an enhanced variant of GCG.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22552
Loading