Contrastive Negative Preference Optimization for Machine Unlearning in LLMs

Contrastive Negative Preference Optimization for Machine Unlearning in LLMs

ICLR 2026 Conference Submission10628 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Unlearning, Large language model, Preference optimization, Noisy Contrastive Estimation

Abstract: During large-scale training on extensive corpora, language models inevitably memorize unwanted data (e.g., private or copyrighted content). While numerous unlearning methods have been proposed—including gradient ascent (GA)-based approaches and preference-based optimization—existing methods either fail to effectively erase target data or achieve a reasonable balance between unlearning efficacy and model utility. A grounded optimization framework is lacking. In this work, we present Contrastive Negative Preference Optimization (CNPO), a novel algorithm that leverages inter-sample relationships within datasets to effectively and adaptively remove target data while maintaining model performance on the remaining set. In order to separate the remaining data and target data, we follow the idea of Noisy Contrastive Estimation (NCE) and derived the final loss function within the framework of preference optimization. Through an asymptotic analysis of CNPO, we theoretically establish its connections with GA and NPO. Furthermore, to evaluate the usability of response and privacy protection capability of CNPO, we introduce a personally identifiable information (PII) dataset and develop a suite of metrics for generated text assessment. Overall, theoretical analysis and comprehensive evaluation on three benchmarks demonstrate CNPO’s stable unlearning behavior and optimal balance between forgetting and utility preservation among existing methods.

Supplementary Material: pdf

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 10628

Loading