Influence Dynamics and Stagewise Data Attribution

Influence Dynamics and Stagewise Data Attribution

ICLR 2026 Conference Submission20207 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Training data attribution, influence functions, singular learning theory, stagewise development, phase transitions, developmental interpretability, Bayesian influence functions

TL;DR: We demonstrate that neural network influence functions can change dramatically over training due to stagewise development, which challenges the static attribution paradigm and motivates a shift to stagewise data attribution.

Abstract: Current training data attribution (TDA) methods treat influence as static, ignoring the fact that neural networks learn in distinct stages. This stagewise development, driven by phase transitions on a degenerate loss landscape, means a sample's importance is not fixed but changes throughout training. In this work, we introduce a developmental framework for data attribution, grounded in singular learning theory. We predict that influence can change non-monotonically, including sign flips and sharp peaks at developmental transitions. We first confirm these predictions analytically and empirically in a toy model, showing that dynamic shifts in influence directly map to the model's progressive learning of a semantic hierarchy. Finally, we demonstrate these phenomena at scale in language models, where token-level influence changes align with known developmental stages.

Primary Area: interpretability and explainable AI

Submission Number: 20207

Loading