Keywords: Machine Unlearning, AI Safety, Large Language Model
Abstract: Current machine unlearning methods for large language models (LLMs) struggle with a persistent trade-off between forgetting effectiveness and overall model utility. We attribute this trade-off to two empirical observations: (i) layer-wise logit accumulation toward a target token is driven more by the output token itself than by the input query, and (ii) hidden states that produce the same token vary only along directions orthogonal to the unembedding row $u_k$, creating what we term the same-output plane. Because a forget input shares its logit pathway with all retained contexts generating the same token, simply suppressing the forget logit inevitably compromises performance on those contexts. To overcome this, we propose **Break the Output Geometry (BOG)**. This approach preserves the same-output plane and specifically displaces the forget input away from it along the single direction $u_k$, using a margin derived from the model’s cross-target statistics. Empirically, BOG demonstrates a superior forget–retain trade-off on the TOFU benchmark.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 66
Loading