Track: Technical
Keywords: Large Language Models, Alignment, Elasticity of LLMs, Safety
TL;DR: We demonstrate the elasticity of post-alignment models, forming resistance to alignment.
Abstract: Large language models (LLMs) may exhibit undesirable behaviors. Recent efforts have focused on aligning these models to prevent harmful generation. Despite these efforts, studies have shown that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Do alignment fine-tuning have robust effects on models, or are merely *superficial* ? In this work, we answer this question through both theoretical and empirical means. Empirically, we demonstrate the *elasticity* of post-alignment models, *i.e.*, the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Using compression theory, we formally derive that such fine-tuning process *disproportionately* undermines alignment compared to pre-training, potentially by orders of magnitude.
We conduct experimental validations to confirm the presence of *elasticity* across models of varying types and sizes. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. We further reveal that *elasticity* positively correlates with increased model size and the expansion of pre-training data.
Our discovery signifies the importance of taming the inherent elasticity of LLMs, thereby overcoming the resistance of LLMs to alignment finetuning.
Submission Number: 108
Loading