MPSelectTune: Prompt-type Selection for Fine-tuning improves Concept Unlearning in LLMs

Shubhadip Nag; Srinjoy Das; Agniva Saha; Anushree Ghosh; Soumi Das; Tarun Kumar; Suparna Bhattacharya; Sourangshu Bhattacharya

MPSelectTune: Prompt-type Selection for Fine-tuning improves Concept Unlearning in LLMs

Shubhadip Nag, Srinjoy Das, Agniva Saha, Anushree Ghosh, Soumi Das, Tarun Kumar, Suparna Bhattacharya, Sourangshu Bhattacharya

Published: 29 Sept 2025, Last Modified: 22 Oct 2025NeurIPS 2025 - Reliable ML WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Unlearning, Prompt Selection

TL;DR: A method for prompt type selection from multiple prompt types for fine-tuning LLMs for concept unlearning

Abstract: LLMs can be conveniently adapted to a diverse set of tasks, e.g, prediction, question-answering tasks, etc, using appropriate prompts with few-shot examples. Biased or harmful concepts, e.g. gender or bio-weapons, present in pre-trained LLMs can lead to unsafe or unethical responses for many such prompts. Removing such undesirable concepts robustly across different prompt types remains a challenging problem, since existing unlearning methods typically ignore the impact of prompt variation. In this paper, we explore a novel adversarial approach to use a joint prompt for the main task and concept task prediction. We show that fine-tuning using the ``worst prompt type'' for concept prediction (with the highest concept accuracy) improves the average unlearning performance over a fine-tuning method that uses a combination of all prompt types. Our proposed method, MPSelectTune, is a two-stage approach that minimizes the concept accuracy of the highest accuracy-prompt type, after fine-tuning using a novel multi-task loss using multiple prompt types. Experimental results on four benchmarks show 2 - 15\% main task accuracy improvements over recent baselines and while reducing the worst-case concept accuracy by up to 17\% compared to recent baselines.

Submission Number: 188

Loading