TL;DR: A new alignment method that improves safety in LLMs by identifying and upweighting vulnerable alignment data.
Abstract: Harmful fine-tuning (HFT), performed directly on open-source LLMs or through Fine-tuning-as-a-Service, breaks safety alignment and poses significant threats. Existing methods aim to mitigate HFT risks by learning robust representation on alignment data or making harmful data unlearnable, but they treat each data sample equally, leaving data vulnerability patterns understudied. In this work, we reveal that certain subsets of alignment data are consistently more prone to forgetting during HFT across different fine-tuning tasks and exhibit lower robustness compared to other subsets. Inspired by these findings, we propose Vulnerability-Aware Alignment (VAA), which calculates data vulnerability, partitions data into "vulnerable" and "invulnerable" groups, and encourages balanced learning using a group distributionally robust optimization (Group DRO) framework. Specifically, VAA learns an adversarial sampler that samples examples from the currently underperforming group and then applies group-dependent adversarial perturbations to the data during training, aiming to encourage a balanced learning process across groups. Experiments across four fine-tuning tasks demonstrate that VAA significantly reduces harmful scores while preserving downstream task performance, outperforming state-of-the-art baselines.
Lay Summary: Large language models (LLMs) are powerful tools that can be adapted to many tasks. However, when people fine-tune these models — especially using harmful or unsafe data — the model can “forget” how to behave safely. This process, called harmful fine-tuning, is a growing concern, especially as open-source models and fine-tuning services become more widely available.
In our research, we found that some parts of the original safety training data are easier for the model to forget than others. These “easy-to-forget” examples are often the most important for teaching the model safe behavior. Yet, most current methods treat all data equally, which can leave models more vulnerable to harmful fine-tuning.
We propose a new method called Vulnerability-Aware Alignment (VAA). It identifies which data is most likely to be forgotten and gives it more attention during training. By grouping the data and adjusting how the model learns from each group, our method helps the model stay safer, even when fine-tuned later.
This work offers a more targeted way to improve model safety and helps build more trustworthy AI systems.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/ChanLiang/VAA
Primary Area: Deep Learning->Large Language Models
Keywords: large language models, robustness, safety alignment, harmful Fine-tuning, forgetting, group distributionally robust optimization, balance learning, ai safety
Submission Number: 14446
Loading