UnSTAR: Unlearning with Self-Taught Anti-Sample Reasoning for LLMs

Yash Sinha; Murari Mandal; Mohan Kankanhalli

UnSTAR: Unlearning with Self-Taught Anti-Sample Reasoning for LLMs

Yash Sinha, Murari Mandal, Mohan Kankanhalli

16 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Machine Unlearning, Large Language Models

Abstract: The key components of machine learning are data samples for training, models for learning patterns, and loss functions for optimizing accuracy. Analogously, unlearning can potentially be achieved through anti-data samples (or anti-samples), unlearning methods, and reversed loss functions. While prior research has explored unlearning methods and reversed loss functions, the potential of anti-samples remains largely untapped. In this paper, we introduce UnSTAR: $\underline{\text{Un}}$learning with $\underline{\text{S}}$elf-$\underline{\text{T}}$aught $\underline{\text{A}}$nti-Sample $\underline{\text{R}}$easoning for large language models (LLMs). Our contributions are threefold: first, we propose a novel concept of anti-sample-induced unlearning; second, we generate anti-samples by leveraging misleading rationales, which help reverse learned associations and accelerate the unlearning process; and third, we enable fine-grained targeted unlearning, allowing for the selective removal of specific associations without impacting related knowledge—something not achievable by previous works. Results demonstrate that anti-samples offer an efficient, targeted unlearning strategy for LLMs, opening new avenues for privacy-preserving machine learning and model modification.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1105

Loading