Primary Area: visualization or interpretation of learned representations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Interpretability, factorization, fisher information
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: As deep learning models are deployed in more and more settings, it becomes in-
creasingly important to be able to understand why they produce a given prediction,
but interpretation of these models remains a challenge. In this paper, we introduce
a novel interpretability method called NPEFF that is readily applicable to any
end-to-end differentiable model. It operates on the principle that processing of a
characteristic shared across different examples involves a specific subset of model
parameters. We perform NPEFF by decomposing each example’s Fisher infor-
mation matrix as a non-negative sum of components. These components take the
form of either non-negative vectors or rank-1 positive semi-definite matrices de-
pending on whether we are using diagonal or low-rank Fisher representations, re-
spectively. For the latter form, we introduce a novel and highly scalable algorithm.
We demonstrate that components recovered by NPEFF have interpretable tunings
through experiments on language and vision models. Using unique properties of
NPEFF’s parameter-space representations, we ran extensive experiments to verify
that the connections between directions in parameters space and examples recov-
ered by NPEFF actually reflect the model’s processing. We further demonstrate
NPEFF’s ability to uncover processing strategies actually used by a model by cre-
ating a TRACR-compiled model with known ground truth. We explore a potential
applications of NPEFF in uncovering and correcting flawed heuristics used by a
model. We release our code to faciliate research using NPEFF.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4277
Loading