Decomposing and Editing Predictions by Modeling Model Computation

Harshay Shah; Andrew Ilyas; Aleksander Madry

Decomposing and Editing Predictions by Modeling Model Computation

Harshay Shah, Andrew Ilyas, Aleksander Madry

Published: 16 Jun 2024, Last Modified: 16 Jun 2024HiLD at ICML 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: model components, linear surrogate models, model editing, science of deep learning, machine learning

TL;DR: A general framework for decomposing and editing predictions via surrogate modeling

Abstract: *How does the internal computation of a machine learning model transform inputs into predictions?* To tackle this question, we introduce a framework called *component modeling* for decomposing a model prediction in terms of its components---architectural "building blocks" such as convolution filters or attention heads. We focus on a special case of this framework, *component attribution*, where the goal is to estimate the counterfactual impact of individual components on a given prediction. We then present COAR, a scalable algorithm for estimating component attributions, and demonstrate its effectiveness across models, datasets and modalities. Finally, we show that COAR directly enables effective model editing.

Student Paper: Yes

Submission Number: 20

Loading