Abstract: LLMs are conveniently used for many prediction and question-answering tasks, using in-context learning. Biased or harmful concepts in pre-trained LLMs can result in unsafe or unethical responses. LLM concept unlearning can ensure the safety and compliance of the responses. Existing approaches for concept unlearning from LLMs do not consider the effect of multiple prompts on the unlearning performance. In this paper, we explore a novel adversarial approach to using a joint prompt for the main task and concept prediction. We ask, does fine-tuning on the worst prompt for concept prediction improve the average unlearning performance using any prompt? To answer, we propose a two-stage approach, called MPSelectTune, which minimizes the concept accuracy of the highest accuracy-prompt, after fine-tuning using a novel multi-task loss using multiple prompts. Experimental results on four benchmarks show $2 - 15\%$ main task accuracy improvements over recent baselines and while reducing the worst-case concept accuracy by up to $17\%$ compared to recent baselines.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: LLM Unlearning, Instruction Fine-tuning
Contribution Types: NLP engineering experiment
Languages Studied: English
Keywords: LLM Unlearning, Instruction Fine-tuning
Submission Number: 2932
Loading