Decomposing and Editing Predictions by Modeling Model Computation

Harshay Shah; Andrew Ilyas; Aleksander Madry

Decomposing and Editing Predictions by Modeling Model Computation

Harshay Shah, Andrew Ilyas, Aleksander Madry

Published: 09 Oct 2024, Last Modified: 15 Dec 2024MINT@NeurIPS2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: attribution, interpretability, model editing, science of deep learning, machine learning

TL;DR: A general framework for decomposing and editing predictions by explicitly modeling the internal computation graph of the model.

Abstract: How does the internal computation of a machine learning model transform inputs into predictions? To tackle this question, we introduce a framework called component modeling for decomposing a model prediction in terms of its components---architectural "building blocks" such as convolution filters or attention heads. We focus on a special case of this framework, component attribution, where the goal is to estimate the counterfactual impact of individual components on a given prediction. We then present COAR, a scalable algorithm for estimating component attributions, and demonstrate its effectiveness across models, datasets and modalities. Finally, we show that COAR directly enables effective model editing.

Email Of Author Nominated As Reviewer: harshay@mit.edu

Submission Number: 18

Loading