Abstract: Supervised fine-tuning (SFT) on benign data
can paradoxically erode a language model’s
safety alignment, a phenomenon known as
catastrophic forgetting of safety behaviors. Al
though prior work shows that randomly adding
safety examples can reduce harmful output, the
principles that make certain examples more ef
fective than others remain poorly understood.
This paper investigates the hypothesis that the
effectiveness of a safety example is governed
by two key factors: its instruction-response
behavior (e.g., refusal vs. explanation) and
its semantic diversity across harm categories.
We systematically evaluate sampling strate
gies based on these axes and find that struc
tured, diversity-aware sampling significantly
improves model safety. Our method reduces
harmfulness by up to 41% while adding only
0.05% more data to the fine-tuning set.
Loading