Robust Safety Guarantee for Large Language Models via Preference-Augmented Distributional Alignment

Robust Safety Guarantee for Large Language Models via Preference-Augmented Distributional Alignment

ICLR 2026 Conference Submission15036 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Safety Alignment, Preference Alignment

Abstract: Domain-specific fine-tuning of large language models (LLMs) often compromises their safety alignment, leading to unsafe generations. Existing approaches largely rely on distributional alignment, enforcing token-level similarity between pre- and post-fine-tuned models. However, this neglects the semantic nature of text generation and can weaken the model’s reasoning and robustness. To address this limitation, we propose a preference-based alignment framework that complements distributional alignment by biasing the fine-tuned model toward the safe outputs of the pre-trained model, rather than strictly preserving distributional similarity. Simulation results show that preference alignment produces consistent safe outputs even when the underlying distributions differ. Extensive experiments on multiple fine-tuning attack datasets and utility benchmarks further demonstrate that our method substantially improves safety with only minor degradation in utility. This achieves a more favorable balance between safety and utility, and significantly enhances robustness against adversarial fine-tuning.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 15036

Loading