Keywords: Large Language Models, Safety Alignment, Preference Alignment
Abstract: Domain-specific fine-tuning of large language models (LLMs) often compromises their safety alignment, leading to unsafe generations. Existing approaches largely rely on distributional alignment, enforcing token-level similarity between pre- and post-fine-tuned models. However, this neglects the semantic nature of text generation and can weaken the model’s reasoning and robustness. To address this limitation, we propose a preference-based alignment framework that complements distributional alignment by biasing the fine-tuned model toward the safe outputs of the pre-trained model, rather than strictly preserving distributional similarity. Simulation results show that preference alignment produces consistent safe outputs even when the underlying distributions differ. Extensive experiments on multiple fine-tuning attack datasets and utility benchmarks further demonstrate that our method substantially improves safety with only minor degradation in utility. This achieves a more favorable balance between safety and utility, and significantly enhances robustness against adversarial fine-tuning.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 15036
Loading