Orthogonal Gradient Projection for Continual LLM Unlearning

Published: 05 Mar 2026, Last Modified: 05 Mar 2026ICLR 2026 Workshop RSI ShortPaperEveryoneRevisionsCC BY 4.0
Keywords: Large language model, unlearning, continual unlearning
TL;DR: We propose Orthogonal Negative Preference Optimization (ONPO), a lightweight plug-in for preference-based unlearning.
Abstract: Machine unlearning aims to remove targeted information from large language models (LLMs) without full retraining, but existing methods often degrade utility and become unstable in continual settings when deletion requests arrive sequentially. We study continual LLM unlearning through the lens of gradient interference: successive forgetting updates can conflict with earlier unlearning steps, leading to cascading utility loss or regression on previously forgotten behavior. We propose Orthogonal Negative Preference Optimization (ONPO), a lightweight plug-in for preference-based unlearning that projects each step’s update onto the orthogonal complement of a low-dimensional subspace spanned by cached gradients from previous unlearning requests. This orthogonalization conservatively limits first-order changes to prior unlearning objectives, mitigating over-unlearning drift. On the TOFU continual unlearning setting, ONPO improves the trade-off between Forget Quality and Model Utility over gradient ascent and NPO baselines.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 133
Loading