Abstract: Understanding the heuristics and algorithms that comprise a model's behavior is important for safe and reliable deployment.
While gradient clustering has been used for this purpose, gradients of a single log probability capture only a slice of the model's behavior, and clustering can only assign a single factor to each behavior.
We introduce NPEFF (Non-Negative Per-Example Fisher Factorization), an interpretability method that overcomes these limitations by decomposing per-example Fisher matrices using a novel decomposition algorithm that learns a set of components represented by learned rank-1 positive semi-definite matrices.
Through a combination of human evaluation and automated analysis, we demonstrate that these NPEFF components correspond to heuristics used by language models on a variety of text processing tasks.
We find that NPEFF excels at decomposing behaviors comprised of multiple factors compared to the baselines of gradient clustering and activation sparse autoencoders.
We also show how NPEFF can be adapted to be more efficient on tasks with few classes.
We further show how to construct parameter perturbations from NPEFF components to selectively disrupt a given component's role in the model's processing.
Along with conducting extensive ablation studies, we include experiments using NPEFF to study in-context learning.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Serguei_Barannikov1
Submission Number: 5532
Loading