A Unified First‑Order Framework for Activation Steering and Data Influence

Yan Leng

A Unified First‑Order Framework for Activation Steering and Data Influence

Yan Leng

15 Sept 2025 (modified: 22 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: activation steering, influence function, large language models

TL;DR: Activation steering and data influence are first‑order equivalent; we give constructive mappings, a feasibility test, and spectral directions for practical control and provenance.

Abstract: Activation steering adds a low‑dimensional vector to an intermediate layer of a neural network to elicit or suppress behaviors, whereas influence functions trace the effect of infinitesimally re‑weighting training examples on model outputs. We prove that, to first order, these techniques are provably equivalent: any steering vector can be represented as an influence weighting over training data and vice versa. This duality yields: (i) a constructive algorithm for mapping undesired behaviors back to causal training examples; (ii) an optimal‐control perspective on steering that reveals its regularization properties; and (iii) generalization bounds for low‑rank steering interventions. Our analysis adds theoretical clarity to two popular but previously disconnected strands of interpretability research.

Supplementary Material: pdf

Primary Area: interpretability and explainable AI

Submission Number: 5327

Loading