Narrow Finetuning Leaves Clearly Readable Traces in the Activation Differences

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Steering, Automated interpretability, Benchmarking interpretability
Other Keywords: Model Diffing
TL;DR: Activation differences between base and narrowly finetuned models reveal the finetune’s objective, even on unrelated data, letting an interpretability agent reliably detect finetuning objectives.
Abstract: Model diffing represents a promising approach for understanding how finetuning modifies neural networks by studying the difference between the base and finetuned model. A natural starting point for developing such techniques is to study narrowly finetuned model organisms, where the specific behavioral changes introduced during training provide a known ground truth for evaluation. Recent work has developed numerous "model organisms"---targeted finetunes designed to study specific inserted behaviors---providing an opportunity for such evaluation. In this work, we show that we can often read out the training objective by analyzing activation differences between base and finetuned models on the first few tokens of random text data. Moreover, steering with this difference allows us to recover the format and general content of the training data. Overall, we find this simple and cheap approach highly informative across multiple model organisms. Our analysis spans synthetic document finetuning for false facts, emergent misalignment, subliminal learning, and taboo guessing game models across different architectures (Gemma, LLaMA, Qwen) and scales (1B to 32B parameters). Even for behaviors that appear non-obvious upon initial inspection, the activation differences reliably reveal information about the finetuning domain. Using an interpretability agent, we demonstrate that these activation differences enable highly accurate identification of finetuning domains, significantly outperforming blackbox agents. We hypothesize that these effects stem from overfitting during narrow finetuning, and show that training with less data or mixing in other data may reduce these detectable artifacts. These findings raise important questions about the validity of studying model organisms that exhibit such readily detectable biases, as they may not adequately represent more naturally acquired behaviors. This calls for iterative refinement to develop realistic model organisms suitable for model-diffing study.
Submission Number: 192
Loading