STRIDE: Training Data Attribution Can Be Estimated In Activation Space

Published: 02 Mar 2026, Last Modified: 02 Apr 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Training Data Attribution, Activation Space, Efficiency, Model Representations, Interpretability
TL;DR: STRIDE estimates training data influence directly in activation space via learned steering operators, outperforming gradient-based methods while being orders of magnitude faster.
Abstract: Understanding which training examples drive specific model behaviors is central to debugging failures, investigating safety issues, and auditing deployed systems. However, existing attribution methods operate in parameter space, where costs grow rapidly with model size. Approximations enable scaling, but introduce overhead that limits low-latency and scalable deployment. STRIDE is a scalable framework that estimates influence directly in activation space, bypassing explicit parameter interactions. STRIDE learns low-rank steering operators that approximate the effect of retraining on data subsets by shifting internal representations. We then recover per-example influence scores by solving a regularized regression problem that decomposes these subset-level shifts. Experiments show that STRIDE accurately identifies influential examples and detects data leakage, outperforming prior methods while being orders of magnitude faster and scalable.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 110
Loading