MPSelectTune: Prompt-type Selection for Fine-tuning improves Concept Unlearning in LLMs

Published: 29 Sept 2025, Last Modified: 22 Oct 2025NeurIPS 2025 - Reliable ML WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Unlearning, Prompt Selection
TL;DR: A method for prompt type selection from multiple prompt types for fine-tuning LLMs for concept unlearning
Abstract: LLMs can be conveniently adapted to a diverse set of tasks, e.g, prediction, question-answering tasks, etc, using appropriate prompts with few-shot examples. Biased or harmful concepts, e.g. gender or bio-weapons, present in pre-trained LLMs can lead to unsafe or unethical responses for many such prompts. Removing such undesirable concepts robustly across different prompt types remains a challenging problem, since existing unlearning methods typically ignore the impact of prompt variation. In this paper, we explore a novel adversarial approach to use a joint prompt for the main task and concept task prediction. We show that fine-tuning using the ``worst prompt type'' for concept prediction (with the highest concept accuracy) improves the average unlearning performance over a fine-tuning method that uses a combination of all prompt types. Our proposed method, MPSelectTune, is a two-stage approach that minimizes the concept accuracy of the highest accuracy-prompt type, after fine-tuning using a novel multi-task loss using multiple prompt types. Experimental results on four benchmarks show 2 - 15\% main task accuracy improvements over recent baselines and while reducing the worst-case concept accuracy by up to 17\% compared to recent baselines.
Submission Number: 188
Loading