Model Unlearning via Sparse Autoencoder Subspace Guided Projections

Xu Wang; Zihao Li; Benyou Wang; Yan Hu; Difan Zou

Model Unlearning via Sparse Autoencoder Subspace Guided Projections

Xu Wang, Zihao Li, Benyou Wang, Yan Hu, Difan Zou

Published: 11 Jun 2025, Last Modified: 18 Jun 2025MUGen @ ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Model Unlearning, Sparse Autoencoder, Subspace Projection

Abstract: Large language models (LLMs) store vast knowledge but pose privacy and safety risks when targeted content must be removed. Existing unlearning approaches such as gradient-based methods, model editing, and SAE-based either lack interpretability or remain vulnerable to adversarial prompts. We introduce **S**AE–Guided **S**ubspace **P**rojection **U**nlearning (**SSPU**), which extracts SAE features most/least correlated with the forget topic to form “relevant” and “irrelevant” subspaces, then optimizes a combined unlearning and regularization loss that guides precise, interpretable updates in parameter space. On WMDP–Cyber and three utility benchmarks (MMLU, TruthfulQA, GSM8K), SSPU reduces harmful knowledge by **3.22%** versus the best baseline and boosts robustness against jailbreak prompts. Our findings expose the limitations of prior unlearning methods and demonstrate how interpretable subspace-guided optimization can achieve robust, controllable model behavior.

Submission Number: 17

Loading