Safety-Anchored Fine-Tuning: Diagnosing and Preventing Safety Collapse in Large Language Models via Adversarial Alignment Anchoring

Published: 03 Jun 2026, Last Modified: 03 Jun 2026AI4GOOD Workshop 2026 RegularEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Safety Alignment, Fine-Tuning Collapse, Large Language Models, Adversarial Training, KL Divergence, Representation Anchoring, Centered Kernel Alignment, PGD, LoRA, Mechanistic Interpretability
TL;DR: Benign fine-tuning can quietly break LLM safety alignment, and current defenses often worsen the issue. We diagnose why this happens and propose SAFT + RepAnchor, which maintain near-baseline safety (3–5% ASR) while improving downstream performance.
Abstract: Fine-tuning a safety-aligned language model on a completely benign dataset should not destroy its alignment, yet in practice it consistently does. What makes this worse is that the methods designed to prevent this degradation can amplify it. Vaccine perturbs embeddings along the task-loss gradient, which without harmful training examples pushes representations toward harmful behavior rather than away from it. SAP relies on a learned safety probe trained on harmful examples; on a benign distribution the probe never activates and its perturbations become directionally arbitrary. In our experiments, Vaccine reaches 99% attack success rate (ASR) in medical fine-tuning and 94% in finance, compared to 14% and 81% for undefended SFT. SAP reaches 21.5% and 73.5%, EWC 9.5% and 78.5%, and RepNoise 10.0% and 88.5%. We study the failure mechanistically: Centered Kernel Alignment (CKA) shows collapse concentrates in posterior layers (28–32) for code and finance, and upper-middle layers (18–22) for medical, front-loaded within 50–250 steps, with mean CKA above 0.988. This points to an output-level rather than representational failure. We propose Safety-Anchored Fine-Tuning (SAFT), which combines a PGD inner loop with KL divergence to a frozen aligned reference as the adversarial objective, and RepAnchor, a CKA-drift-weighted MSE penalty on the most vulnerable layers. SAFT+RepAnchor achieves ASRs of 2.5%, 2.0%, and 4.0% across code, finance, and medical, at or below the 4.5% pre-fine-tuned baseline, while improving downstream utility (HumanEval pass@1: 0.780 vs. 0.720 for SFT). Safety and utility are continuously tunable through λ and γ.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 391
Loading