Keywords: Safety alignment, information bottleneck, contrastive learning, fine-tuning, continual learning
Abstract: Fine-tuning large language models on downstream tasks often degrades their safety alignment, a problem that compounds during sequential adaptation. We introduce Conditional Information Bottleneck (CIB), which preserves safety by encouraging fine-tuned representations to remain close to those of an aligned reference model. Our insight is that aligned models already encode safety-relevant structure, serving as implicit supervision without requiring safety labels. Information-theoretic analysis shows that maximizing mutual information between fine-tuned and reference representations preserves this structure, while the alignment tax, the performance cost of safety constraints-remains small for benign tasks where task labels are largely independent of safety structure. Experiments across multiple model families demonstrate substantial safety improvements with minimal performance degradation, and strong correlations between our theoretical quantities and harm rates validate our analysis.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: fine-tuning; continual learning; safety and alignment
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 8097
Loading