Keywords: LLM Unlearning, Prompt Selection
TL;DR: A method for prompt type selection from multiple prompt types for fine-tuning LLMs for concept unlearning
Abstract: LLMs can be conveniently adapted to a diverse set of tasks, e.g, prediction, question-answering tasks, etc, using appropriate prompts with few-shot examples.
Biased or harmful concepts, e.g. gender or bio-weapons, present in pre-trained LLMs can lead to unsafe or unethical responses for many such prompts.
Removing such undesirable concepts robustly across different prompt types remains a challenging problem, since existing unlearning methods typically ignore the impact of prompt variation.
In this paper, we explore a novel adversarial approach to use a joint prompt for the main task and concept task prediction.
We show that fine-tuning using the ``worst prompt type'' for concept prediction (with the highest concept accuracy) improves the average unlearning performance over a fine-tuning method that uses a combination of all prompt types.
Our proposed method, MPSelectTune, is a two-stage approach that minimizes the concept accuracy of the highest accuracy-prompt type, after fine-tuning using a novel multi-task loss using multiple prompt types.
Experimental results on four benchmarks show 2 - 15\% main task accuracy improvements over recent baselines and while reducing the worst-case concept accuracy by up to 17\% compared to recent baselines.
Submission Number: 188
Loading