Do as I do (Safely): Mitigating Task-Specific Fine-tuning Risks in Large Language Models

Francisco Eiras; Aleksandar Petrov; Philip Torr; M. Pawan Kumar; Adel Bibi

Do as I do (Safely): Mitigating Task-Specific Fine-tuning Risks in Large Language Models

Francisco Eiras, Aleksandar Petrov, Philip Torr, M. Pawan Kumar, Adel Bibi

Published: 22 Jan 2025, Last Modified: 28 Feb 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models, safety, fine-tuning

TL;DR: We study fine-tuning risks associated with task-specific fine-tuning, showing malicious users can increase harmfulness by modifying almost any task-specific dataset and providing a novel mitigation strategy based on mimicking user data.

Abstract: Recent research shows that fine-tuning on benign instruction-following data can inadvertently undo the safety alignment process and increase a model's propensity to comply with harmful queries. While instruction-following fine-tuning is important, task-specific fine-tuning-where models are trained on datasets with clear ground truth answers (e.g., multiple choice questions)-can enhance model performance on specialized downstream tasks. Understanding and mitigating safety risks in the task-specific setting remains distinct from the instruction-following context due to structural differences in the data. Our work demonstrates how malicious actors can subtly manipulate the structure of almost *any* task-specific dataset to foster significantly more dangerous model behaviors, while maintaining an appearance of innocuity and reasonable downstream task performance. To address this issue, we propose a novel mitigation strategy that mixes in safety data which *mimics* the task format and prompting style of the user data, showing this is significantly more effective and efficient than existing baselines at re-establishing safety alignment while maintaining similar task performance.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3608

Loading