Open Source Links: https://github.com/marek357/lora_interp
Keywords: Causal interventions, Understanding high-level properties of models, Developmental interpretability
Other Keywords: model diffing
TL;DR: TopKLora is a fine-tuning adapter, interpretable by design with functionally causal features to the downstream task
Abstract: Model diffing finds the representational differences between a base and a fine-tuned model. Leading approaches use sparse-dictionary learning. However, these methods are trained post-hoc on a reconstruction loss, which results in features that often fail to be functionally causal for model behaviour. In this work, we introduce TopKLoRA -- a LoRA-like adapter, which retains LoRA’s adapter-style deployment and low-rank updates while exposing an input-conditioned, discrete selection of feature directions that provide controllable levers for the model behaviour, unlike reconstruction-trained features.
Different from standard LoRA, we do not train a low-rank dense adapter, but instead a high-rank sparse adapter by applying the TopK sparsity in the adapter space, incentivising interpretability, while retaining the conceptual idea of LoRA.
Each active component in the adapter space corresponds to a rank-1 "feature direction", and the per-example update has a low effective rank of at most $k$ with $k\ll d_{\text{model}}$. In our experiments, we train adapters across four adapter dimensions and $k$ combinations for a harmfulness-reduction task with direct preference optimisation (DPO) of a supervised fine-tuned Gemma 2 2B base model for instruction following. We demonstrate maintained downstream task performance on the Real Toxicity Prompts benchmark relative to a dense LoRA measured by the Perspective API score. Moreover, we identify interpretable and causal features in the sparse space through an autointerp study along each rank-1 feature direction. This method provides interpretable model diffing information "for free" without degrading downstream task performance. More broadly, this work demonstrates the effectiveness of incorporating intrinsically interpretable model segments trained on the downstream loss. We publish the code at: https://github.com/marek357/lora_interp
Submission Number: 318
Loading