Automatic Discovery of Visual Circuits

NeurIPS 2023 Workshop ATTRIB Submission56 Authors

Published: 27 Oct 2023, Last Modified: 08 Dec 2023ATTRIB PosterEveryoneRevisionsBibTeX
Keywords: interpretability, circuits, computer vision
TL;DR: We introduce a new algorithm for automated circuit discovery in vision models and a new dataset of compositional concepts for evaluating circuit-discovery methods.
Abstract: To date, most discoveries of subnetworks that implement human-interpretable computations in deep vision models have involved close study of single units and large amounts of human labor. We explore scalable methods for extracting the subgraph of a vision model’s computational graph that underlies a particular capability. In this paper, we formulate capabilities as mappings of human-interpretable visual concepts to intermediate feature representations. We introduce a new method for identifying these subnetworks: specifying a visual concept using a few examples, and then tracing the interdependence of neuron activations across layers, or their functional connectivity. We find that our approach extracts circuits that causally affect model output, and that editing these circuits can defend large pretrained models from adversarial attacks.
Submission Number: 56