Gauss-Newton Unlearning for the LLM Era

Lev E McKinney; Anvith Thudi; Juhan Bae; Tara Rezaei Kheirkhah; Nicolas Papernot; Sheila A. McIlraith; Roger Baker Grosse

Gauss-Newton Unlearning for the LLM Era

Lev E McKinney, Anvith Thudi, Juhan Bae, Tara Rezaei Kheirkhah, Nicolas Papernot, Sheila A. McIlraith, Roger Baker Grosse

Published: 11 Jun 2025, Last Modified: 04 Jul 2025MUGen @ ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Safety, Large Language Models, Unlearning, Second-Order Optimization

TL;DR: We find that Gauss‑Newton steps, a classic unlearning technique, can be scaled to LLMs and that it delivers strong performance on LLM unlearning benchmarks, effectively suppressing hazardous outputs and approximating retraining.

Abstract: Large language models (LLMs) can learn to produce sensitive outputs which model deployers may wish to reduce, motivating the use of output suppression (LLM unlearning) methods. We demonstrate that taking only a few uphill Gauss-Newton steps on a forget set provides a conceptually simple, state-of-the-art unlearning algorithm that is underexplored in the LLM literature. We show that these steps can be efficiently and accurately implemented for LLMs using parametric Hessian approximations such as K-FAC. We call this approach **K-FA**C for **D**istribution **E**rasure (K-FADE). Our evaluations demonstrate that K-FADE performs competitively with or better than previous unlearning approaches for LLMs across standard benchmarks. Specifically, K-FADE approximates the output distribution of models re-finetuned with certain data excluded on the ToFU unlearning benchmark. K-FADE also effectively suppresses outputs from a specific distribution while minimally altering the model's outputs on non-targeted data from the WMDP benchmark.

Submission Number: 34

Loading