Does Downstream Fine-Tuning Undo Embedded Activation Steering?

Published: 02 Mar 2026, Last Modified: 06 Mar 2026ICLR 2026 Re-Align WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 10 pages)
Domain: machine learning
Abstract: Activation steering can modify a language model's behaviour by intervening on its internal representations along a feature direction, and linear steering methods can be embedded directly into the model's weights. However, it is unclear whether such embedded interventions persist when the model undergoes further training. We investigate the stability of embedded steering under routine, non-adversarial fine-tuning across five instruction-tuned models (3B--14B parameters), two training paradigms (SFT and RLHF), and two steering targets: refusal suppression and brevity induction. For the latter, we introduce activation amplification, a linear operator that strengthens an existing feature direction and can be embedded in model weights. We find that behavioural preservation varies with the optimisation pressure exerted by training content: steering persists when training data does not contradict the steered behaviour, and degrades when it does. Mechanistically, however, the steering modification itself remains nearly intact in weight space ($\rho < 0.02$) across all conditions, even where behaviour substantially reverts. This dissociation suggests that fine-tuning does not reverse the weight edit, but rather develops alternate pathways that reduce its downstream effect. Embedded steering thus appears durable but not unconditionally robust, and behavioural re-validation after downstream training remains necessary.
Presenter: ~Philipp_E._Glass1
Submission Number: 120
Loading