Keywords: Causal interventions, Understanding high-level properties of models, Developmental interpretability
Other Keywords: model diffing
TL;DR: TopKLora is a fine-tuning adapter, interpretable by design with functionally causal features to the downstream task
Abstract: Model diffing finds the representational differences between a base and a fine-tuned model. However, current sparse dictionary learning based methods are trained post-hoc on a reconstruction loss, which results in features that often fail to be functionally causal for model behaviour. In this work, we introduce TopKLoRA -- a LoRA-like adapter, which retains LoRA’s adapter-style deployment and low-rank updates while exposing a input-conditioned, discrete selection of feature directions that provide controllable levers for the model behaviour, unlike reconstruction-trained features.
Different from standard LoRA, we do not train a low-rank dense adapter, but instead a high-rank sparse adapter by applying the TopK sparsity in the adapter space, incentivising interpretability, while retaining the conceptual idea of LoRA.
Each active component in the adapter space corresponds to a rank-1 "feature direction", and the per-example update has a low effective rank of at most $k$ with $k \ll d_{\text{model}}$. In our experiments, we train adapters across three adapter dimensions and $k$ combinations for instruction-following supervised fine-tuning (SFT) and safety direct preference optimisation (DPO) of the Gemma 2 2B model. We demonstrate maintained downstream task performance on the Real Toxicity Prompts benchmark relative to a dense LoRA measured by the Perspective API score. Moreover, we identify interpretable and causal features in the sparse space through autointerp and ablation studies along each rank-1 feature direction. This method provides interpretable model diffing information "for free" without degrading downstream task performance. More broadly, this work demonstrates the effectiveness of incorporating intrinsically interpretable model segments trained on the downstream loss.
Submission Number: 318
Loading