Safety LLMs fine-tuning via Conditional Information Bottleneck

Safety LLMs fine-tuning via Conditional Information Bottleneck

ACL ARR 2026 January Submission8097 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Safety alignment, information bottleneck, contrastive learning, fine-tuning, continual learning

Abstract: Fine-tuning large language models on downstream tasks often degrades their safety alignment, a problem that compounds during sequential adaptation. We introduce Conditional Information Bottleneck (CIB), which preserves safety by encouraging fine-tuned representations to remain close to those of an aligned reference model. Our insight is that aligned models already encode safety-relevant structure, serving as implicit supervision without requiring safety labels. Information-theoretic analysis shows that maximizing mutual information between fine-tuned and reference representations preserves this structure, while the alignment tax, the performance cost of safety constraints-remains small for benign tasks where task labels are largely independent of safety structure. Experiments across multiple model families demonstrate substantial safety improvements with minimal performance degradation, and strong correlations between our theoretical quantities and harm rates validate our analysis.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: fine-tuning; continual learning; safety and alignment

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 8097

Loading