Representational de-collapse: Interactions between supervised finetuning and in-context learning in language models
Keywords: In-context learning, supervised finetuning, safety
Abstract: Supervised fine-tuning (SFT) is a widely-used method for safety alignment of Language Models (LLMs), ensuring models learn to refuse answering potentially harmful queries or correcting unethical biases. Yet, it has been shown that SFT is brittle and often easily circumvented. One particular susceptibility comes from in-context learning (ICL): models learn from in-context demonstrations to overwrite
previously acquired safety guardrails. In this work, we introduce a simple task enabling us to precisely control and study the progressive overwriting of knowledge acquired from SFT via ICL. We demonstrate that our task shows ICL overwriting of SFT and analyze its dynamics throughout the context. Next, we show that the overwriting behavior scales with model size. Finally, we analyze the hidden representation during the overwriting of SFT by ICL. In line with previous work, we find that SFT collapses representations along task labels. However, we show that ICL can reverse this collapse and recover rich representations, akin to the pre-SFT model state. Overall, our work proposes a controlled setup to investigate the interaction between ICL and SFT in LLMs, especially in the context of safety.
Submission Number: 43
Loading