Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning

Chongyu Fan; Jiancheng Liu; Licong Lin; Jinghan Jia; Ruiqi Zhang; Song Mei; Sijia Liu

Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning

Chongyu Fan, Jiancheng Liu, Licong Lin, Jinghan Jia, Ruiqi Zhang, Song Mei, Sijia Liu

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Machine Unlearning, Preference Optimization

Abstract: This work studies the problem of large language model (LLM) unlearning, aiming to remove unwanted data influences (e.g., copyrighted or harmful content) while preserving model utility. Despite the increasing demand for unlearning, a technically-grounded optimization framework is lacking. Gradient ascent (GA)-type methods, though widely used, are suboptimal as they reverse the learning process without controlling optimization divergence (i.e., deviation from the pre-trained state), leading to risks of model collapse. Negative preference optimization (NPO) has been proposed to address this issue and is considered one of the state-of-the-art LLM unlearning approaches. In this work, we revisit NPO and identify another critical issue: reference model bias. This bias arises from using the reference model (i.e., the model prior to unlearning) to assess unlearning success, which can lead to a misleading impression of the true data-wise unlearning effectiveness. Specifically, it could cause (a) uneven allocation of optimization power across forget data with varying difficulty levels, and (b) ineffective gradient weight smoothing during the early stages of unlearning optimization. To overcome these challenges, we propose a simple yet effective unlearning optimization framework, called SimNPO, showing that simplicity—removing the reliance on a reference model (through the lens of simple preference optimization)—benefits unlearning. We provide deeper insights into SimNPO's advantages, including an analysis based on mixtures of Markov chains. Extensive experiments further validate its efficacy on benchmarks like TOFU, MUSE, and WMDP.

Primary Area: Social and economic aspects of machine learning (e.g., fairness, interpretability, human-AI interaction, privacy, safety, strategic behavior)

Submission Number: 18471

Loading