Keywords: Model Unlearning, Sparse Autoencoder, Subspace Projection
Abstract: Large language models (LLMs) store vast knowledge but pose privacy and safety risks when targeted content must be removed. Existing unlearning approaches such as gradient-based methods, model editing, and SAE-based either lack interpretability or remain vulnerable to adversarial prompts. We introduce **S**AE–Guided **S**ubspace **P**rojection **U**nlearning (**SSPU**), which extracts SAE features most/least correlated with the forget topic to form “relevant” and “irrelevant” subspaces, then optimizes a combined unlearning and regularization loss that guides precise, interpretable updates in parameter space. On WMDP–Cyber and three utility benchmarks (MMLU, TruthfulQA, GSM8K), SSPU reduces harmful knowledge by **3.22%** versus the best baseline and boosts robustness against jailbreak prompts. Our findings expose the limitations of prior unlearning methods and demonstrate how interpretable subspace-guided optimization can achieve robust, controllable model behavior.
Submission Number: 17
Loading