Keywords: activation steering, influence function, large language models
TL;DR: Activation steering and data influence are first‑order equivalent; we give constructive mappings, a feasibility test, and spectral directions for practical control and provenance.
Abstract: Activation steering adds a low‑dimensional vector to an intermediate layer of a neural network to elicit or suppress behaviors, whereas influence functions trace the effect of infinitesimally re‑weighting training examples on model outputs. We prove that, to first order, these techniques are provably equivalent: any steering vector can be represented as an influence weighting over training data
and vice versa. This duality yields: (i) a constructive algorithm for mapping undesired behaviors back to causal training examples; (ii) an
optimal‐control perspective on steering that reveals its regularization properties; and (iii) generalization bounds for low‑rank steering
interventions. Our analysis adds theoretical clarity to two popular but previously disconnected strands of interpretability research.
Supplementary Material: pdf
Primary Area: interpretability and explainable AI
Submission Number: 5327
Loading