Activation Matching for Explanation Generation and Circuit Discovery

Published: 24 Nov 2025, Last Modified: 24 Nov 20255th Muslims in ML Workshop co-located with NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Safety, Interpretability, Explanability, Circuits
TL;DR: In this paper we propose an activation matching based approach for generating minimalist explanations.
Abstract: In this paper we introduce an activation-matching--based approach to generate minimal, faithful explanations for the decision-making of a pretrained classifier on any given image and reveal the underlying compact internal circuits that suffice for its decisions. Given an input image \(x\) and a frozen model \(f\), we train a lightweight Autoencoder to output a binary mask \(m\) such that the explanation \(e = m \odot x\) preserves both the model's prediction and the intermediate activations of \(x\). Our objective combines: (i) multi-layer activation matching with KL Divergence to align distributions and cross-entropy to retain the top-1 label for both the iamge and the explanation; (ii) mask priors---L1 area for minimality, a binarization penalty for crisp 0/1 masks, and total variation for compactness; and (iii) abductive constraints for faithfulness and necessity. Beyond producing per-image explanations, we also introduce a circuit readout procedure wherein using the explanation's forward pass, we identify active channels and construct a channel-level graph, scoring inter-layer edges by ingress weight magnitude times source activation and feature-to-class links by classifier weight magnitude times feature activation. This reveals sparse data-dependent sub-circuits and or internal pathways providing a practical bridge between explainability in the input space and mechanistic circuit analysis.
Track: Track 2: ML by Muslim Authors
Submission Number: 57
Loading