DR.PO: Dual Reference and Preference Optimization for Machine Unlearning in Large Language Model

18 Sept 2025 (modified: 27 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: dual-positive-preference,dual-reference-model,similarity-scores-retain
TL;DR: A dual-positive-preference and dual-reference-model optimization form LLM unlearning
Abstract: Most high-performing LLM unlearning methods use preference learning but have a critical flaw: insufficient or singular positive preferences make post-unlearning models generate meaningless, inconsistent, or single-category outputs. These differ in probability distribution from "fully unlearned" models, causing suboptimal unlearning quality and privacy risks of unlearned data. We propose DR.PO (Dual Reference and Preference Optimization). It adopts answer with incorrect fact and answer indicating information deficiency as positive preferences, with distinct reference models for each. Instead of random sampling the full dataset for the retain set (as in existing methods), we match each forget sample to a corresponding retain sample via similarity scores—reducing repeated sampling of the retain set. Experiments show our method not only achieves better unlearning quality and better privacy protection but also effectively preserves the model’s original capabilities.
Supplementary Material: pdf
Primary Area: generative models
Submission Number: 10210
Loading