UnSTAR: Unlearning with Self-Taught Anti-Sample Reasoning for LLMs

Published: 25 Jun 2025, Last Modified: 25 Jun 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: The key components of machine learning are data samples for training, model for learning patterns, and loss function for optimizing accuracy. Analogously, unlearning can potentially be achieved through anti-data-samples (or anti-samples), unlearning method, and reversed loss function. While prior research has explored unlearning methods and reversed loss functions, the potential of anti-samples remains largely untapped. Although token based anti-samples have been previously introduced (Eldan & Russinovich (2023)), the use of reasoning-driven anti-samples—constructed with falsified answers and misleading rationales—remains unexplored. In this paper, we introduce UnStar: Unlearning with SelfTaught Anti-Sample Reasoning for large language models (LLMs). Our contributions are threefold: first, we propose a novel concept of reasoning-based anti-sample-induced unlearning; second, we generate anti-samples by leveraging misleading rationales, which help reverse learned associations and accelerate the unlearning process; and third, we enable fine-grained targeted unlearning, allowing for the selective removal of specific associations without impacting related knowledge—something not achievable by previous works. Results demonstrate that anti-samples offer an efficient, targeted unlearning strategy for LLMs, opening new avenues for privacy-preserving machine learning and model modification.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Addressed changes requested by Action Editor and prepared CameraReady.
Code: https://github.com/MachineUnlearn/UnStar
Assigned Action Editor: ~Eleni_Triantafillou1
Submission Number: 4414
Loading