How to Fine-Tune Safely on a Budget: Model Adaptation Using Minimal Resources

Published: 04 Nov 2025, Last Modified: 15 Apr 2026EMNLP 2025EveryonearXiv.org perpetual, non-exclusive license
Abstract: Supervised fine-tuning (SFT) on benign data can paradoxically erode a language model’s safety alignment, a phenomenon known as catastrophic forgetting of safety behaviors. Al though prior work shows that randomly adding safety examples can reduce harmful output, the principles that make certain examples more ef fective than others remain poorly understood. This paper investigates the hypothesis that the effectiveness of a safety example is governed by two key factors: its instruction-response behavior (e.g., refusal vs. explanation) and its semantic diversity across harm categories. We systematically evaluate sampling strate gies based on these axes and find that struc tured, diversity-aware sampling significantly improves model safety. Our method reduces harmfulness by up to 41% while adding only 0.05% more data to the fine-tuning set.
Loading