Steering as Implicit Low-Rank Finetuning: A Theoretical Framework for Activation Steering in Transformers
Keywords: steering, fine-tuning, LoRA, activation space, mechanistic interpretability
TL;DR: Rank-one LoRA on value/output matrices acts as implicit activation steering by algebraic necessity, while query/key updates produce context-dependent geometry that only resembles steering when the data distribution cooperates.
Abstract: Activation steering and supervised fine-tuning are common ways to change language model behavior, yet their mechanistic connection remains unclear. We show that they become structurally linked when fine-tuning is done by low-rank updates as in LoRA. In a linear attention layer, any rank one update to value or projection matrices forces all residual stream shifts to lie in a single fixed direction. We also show that the minimum norm rank one update matching a target mean displacement aligns its singular direction with the steering vector, explaining why learned shift directions match activation steering most strongly at small width. In contrast, query and key matrix updates act through context dependent directions. They resemble fixed direction steering only when those directions are nearly collinear across the data. Empirically, updates to the attention query and key matrices produce more variable and less consistently rank one activation shifts than updates to the value and projection matrices. They also require larger weight changes to achieve comparable behavioral effects. When all first layer attention matrices are trainable, the learned residual shift is again nearly rank one, suggesting that optimization tends to use, or align with, the fixed write mechanism when it is available.
Submission Number: 132
Loading