Learn to be Unlearned: Optimizing Language Models for Unlearning via Clustered Gradient Routing

Published: 02 Mar 2026, Last Modified: 06 Mar 2026ICLR 2026 Trustworthy AIEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Unlearning, Privacy
Abstract: A main challenge for machine unlearning in Large Language Models (LLMs) is parameter entanglement: information about specific training examples is spread across many parameters, and parameters usually encode information on different samples. Therefore, unlearning often hurts general performance on samples that should be retained. We propose a simple training mechanism that reduces the entanglement of information from different training samples in the models' parameters by combining unsupervised semantic clustering with gradient routing. First, we cluster the training data into semantic groups. Then, we split the LLM’s parameters into disjoint blocks and train a small cluster-conditioned router that decides which block should receive the main gradient update for each cluster. The forward pass stays unchanged and uses all parameters, but the backward pass becomes sparse using per-block gradient masking. This encourages different data clusters to mainly update different parameter blocks. For unlearning a sample, we freeze the router and apply existing unlearning only to the block that contains information on that sample. This leads to more targeted updates and smaller degradation on unrelated retain samples. On TOFU and MUSE, our approach consistently improves post-unlearning model utility compared to existing unlearning baselines, while keeping forgetting performance and the final privacy risk for the unlearned data on a comparable level. Overall, our work contributes towards a more privacy-friendly deployment of LLMs.
Submission Number: 130
Loading