MPSelectTune: Prompt Selection for Fine-tuning improves Concept Unlearning in LLMs

MPSelectTune: Prompt Selection for Fine-tuning improves Concept Unlearning in LLMs

ACL ARR 2025 May Submission2932 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: LLMs are conveniently used for many prediction and question-answering tasks, using in-context learning. Biased or harmful concepts in pre-trained LLMs can result in unsafe or unethical responses. LLM concept unlearning can ensure the safety and compliance of the responses. Existing approaches for concept unlearning from LLMs do not consider the effect of multiple prompts on the unlearning performance. In this paper, we explore a novel adversarial approach to using a joint prompt for the main task and concept prediction. We ask, does fine-tuning on the worst prompt for concept prediction improve the average unlearning performance using any prompt? To answer, we propose a two-stage approach, called MPSelectTune, which minimizes the concept accuracy of the highest accuracy-prompt, after fine-tuning using a novel multi-task loss using multiple prompts. Experimental results on four benchmarks show $2 - 15\%$ main task accuracy improvements over recent baselines and while reducing the worst-case concept accuracy by up to $17\%$ compared to recent baselines.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: LLM Unlearning, Instruction Fine-tuning

Contribution Types: NLP engineering experiment

Languages Studied: English

Keywords: LLM Unlearning, Instruction Fine-tuning

Submission Number: 2932

Loading