Language Models Resist Alignment

Jiaming Ji; Kaile Wang; Tianyi Qiu; Boyuan Chen; Changye Li; Hantao Lou; Jiayi Zhou; Josef Dai; Yaodong Yang

Language Models Resist Alignment

Jiaming Ji, Kaile Wang, Tianyi Qiu, Boyuan Chen, Changye Li, Hantao Lou, Jiayi Zhou, Josef Dai, Yaodong Yang

Published: 09 Oct 2024, Last Modified: 11 Dec 2024SoLaR PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: Technical

Keywords: Large Language Models, Alignment, Elasticity of LLMs, Safety

TL;DR: We demonstrate the elasticity of post-alignment models, forming resistance to alignment.

Abstract: Large language models (LLMs) may exhibit undesirable behaviors. Recent efforts have focused on aligning these models to prevent harmful generation. Despite these efforts, studies have shown that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Do alignment fine-tuning have robust effects on models, or are merely *superficial* ? In this work, we answer this question through both theoretical and empirical means. Empirically, we demonstrate the *elasticity* of post-alignment models, *i.e.*, the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Using compression theory, we formally derive that such fine-tuning process *disproportionately* undermines alignment compared to pre-training, potentially by orders of magnitude. We conduct experimental validations to confirm the presence of *elasticity* across models of varying types and sizes. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. We further reveal that *elasticity* positively correlates with increased model size and the expansion of pre-training data. Our discovery signifies the importance of taming the inherent elasticity of LLMs, thereby overcoming the resistance of LLMs to alignment finetuning.

Submission Number: 108

Loading