Open Source Links: https://github.com/science-of-finetuning/diffing-toolkit
Keywords: Steering, Automated interpretability, Benchmarking interpretability
Other Keywords: Model Diffing
TL;DR: Activation differences between base and narrowly finetuned models reveal the finetune’s objective, even on unrelated data, letting an interpretability agent reliably detect finetuning objectives.
Abstract: Finetuning on narrow domains has become an essential tool to adapt Large Language Models (LLMs) to specific tasks and to create models with known unusual properties that are useful for research.
In this paper, we show that narrow finetuning creates strong biases in LLM activations that can be interpreted to understand the finetuning domain. These biases can be discovered using simple tools from model diffing---the study of differences between models before and after finetuning.
In particular, analyzing activation differences on the first few tokens of random text and steering by adding this difference to the model activations produces text similar to the format and general content of the finetuning data.
We demonstrate that these analyses contain crucial information by creating an LLM-based interpretability agent to understand the finetuning domain.
Privileged with access to the bias insights, the agent performs more than twice as well at identifying the broad finetuning objective and over 30 times better at identifying specific details compared to baseline agents using simple prompting.
Our analysis spans synthetic document finetuning for false facts, emergent misalignment, subliminal learning, and taboo word guessing game models across different architectures (Gemma, LLaMA, Qwen) and scales (1B to 32B parameters). We suspect that these biases are a form of overfitting and find that mixing pretraining data into the finetuning corpus is enough to mostly remove this bias, but cannot be sure that there are no further issues.
Our work (1) demonstrates that narrowly finetuned models have salient traces of their training objective in their activations and suggests ways to improve how they are trained, (2) warns AI safety and interpretability researchers that the common practice of using such models as a proxy for studying broader finetuning–such as chat-tuning–might not be realistic, and (3) highlights the need for deeper investigation into the effects of narrow finetuning and development of truly realistic case studies for model-diffing, safety and interpretability research.
Submission Number: 192
Loading