What's in Your "Safe" Data?: Identifying Benign Data that Breaks Safety

Published: 04 Mar 2024, Last Modified: 14 Apr 2024SeT LLM @ ICLR 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI Safety, AI Alignment, Trustworthy LLM, Fine-tuning Vulnerabilities, Data Understanding, Data Selection
TL;DR: Our work seeks to understand which benign data is more likely to degrade safety after fine-tuning. We introduce representation and gradient-based methods that effectively select a subset of benign data that jailbreaks models after fine-tuning.
Abstract: Recent research indicates that Large Language Models (LLMs), even those tuned for safety and alignment, are susceptible to jailbreaking. Some have found that just further fine-tuning an aligned model with benign data (i.e., data without harmful content) surprisingly leads to substantial degradation in safety. We delve into the data-centric aspects of why benign fine-tuning inadvertently contributes to jailbreaking. We represent data through two lenses: representation and gradient spaces. We introduce a bi-directional anchoring method that effectively finds subsets of benign data that are more likely to degrade safety after fine-tuning. Training on just 100 of these benign datapoints can lead to the fine-tuned model responding in a potentially unsafe manner for >70% of tested harmful requests, compared to <20% after fine-tuning on randomly selected data. We further find that selected data are often in the form of lists and bullet points, or math questions.
Submission Number: 29
Loading