Abstract: Understanding the heuristics and algorithms that comprise a model's behavior is important for safe and reliable deployment.
While gradient clustering has been used for this purpose, gradients of a single log probability capture only a slice of the model's behavior, and clustering can only assign a single factor to each behavior.
We introduce NPEFF (Non-Negative Per-Example Fisher Factorization), an interpretability method that overcomes these limitations by decomposing per-example Fisher matrices using a novel decomposition algorithm that learns a set of components represented by learned rank-1 positive semi-definite matrices.
Through a combination of human evaluation and automated analysis, we demonstrate that these NPEFF components correspond to heuristics used by language models on a variety of text processing tasks.
We find that NPEFF excels at decomposing behaviors comprised of multiple factors compared to the baselines of gradient clustering and activation sparse autoencoders.
We also show how NPEFF can be adapted to be more efficient on tasks with few classes.
We further show how to construct parameter perturbations from NPEFF components to selectively disrupt a given component's role in the model's processing.
Along with ablation studies, we include experiments using NPEFF to study in-context learning.
Submission Type: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=DUFvZXrQr7
Changes Since Last Submission: The sporadic failures in the perturbation experiments were fixed. These were caused by software bugs. The NPEFF failures were the result of random projections accidentally being applied twice in some cases, which led to a mismatch in the random projection matrix used in the compressed sensing. We recomputed the PEFs using only a single random projection for TriviaQA and CLINC150, and then we re-ran all downstream experiments. Other results did not change significantly with these new PEFs. The failure of GC YAT was due to a bug in the transposed random projection matrix kernel. We re-ran all perturbation experiments, though only GC YAT had a perceptible change.
To expand evaluation breadth, we added a human evaluation study for YAT in addition to TriviaQA. We also added an LLM-evaluation study on those tasks to support the human evaluation.
We reran the runtime comparisons using a vanilla PyTorch implementation of NPEFF, which we provide as a reference implementation in the Supplementary Material. We also include GPU memory comparisons. We also added experiments with early stopping as a means of reducing the computational expense of NPEFF.
We provided clear inclusion criteria for baselines, namely "unsupervisedly explain model behavior across examples by discovering a relatively small number of abstract factors." Reasons that input feature attribution, training data attribution, knowledge localization, and probing methods were not used as baselines were included when introducing baselines.
We included a section in the ablations comparing approximating the expectation using random projections to approximating it using sampling.
Assigned Action Editor: ~Mengnan_Du1
Submission Number: 7283
Loading