Keywords: Safety, Large Language Models, Unlearning, Second-Order Optimization
TL;DR: We find that Gauss‑Newton steps, a classic unlearning technique, can be scaled to LLMs and that it delivers strong performance on LLM unlearning benchmarks, effectively suppressing hazardous outputs and approximating retraining.
Abstract: Large language models (LLMs) can learn to produce sensitive outputs which model deployers may wish to reduce, motivating the use of output suppression (LLM unlearning) methods. We demonstrate that taking only a few uphill Gauss-Newton steps on a forget set provides a conceptually simple, state-of-the-art unlearning algorithm that is underexplored in the LLM literature. We show that these steps can be efficiently and accurately implemented for LLMs using parametric Hessian approximations such as K-FAC. We call this approach **K-FA**C for **D**istribution **E**rasure (K-FADE). Our evaluations demonstrate that K-FADE performs competitively with or better than previous unlearning approaches for LLMs across standard benchmarks. Specifically, K-FADE approximates the output distribution of models re-finetuned with certain data excluded on the ToFU unlearning benchmark. K-FADE also effectively suppresses outputs from a specific distribution while minimally altering the model's outputs on non-targeted data from the WMDP benchmark.
Submission Number: 34
Loading