From Feature Visualization to Visual Circuits: Effect of Model Perturbation

Published: 22 Feb 2026, Last Modified: 22 Feb 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Understanding the inner workings of large-scale deep neural networks is challenging yet crucial in several high-stakes applications. Mechanistic interpretability is an emergent field that tackles this challenge, often by identifying human-understandable subgraphs in deep neural networks known as circuits. In vision-pretrained models, these subgraphs are typically interpreted by visualizing their node features through a popular technique called feature visualization. Recent works have analyzed the stability of different feature visualization types under the adversarial model manipulation framework, where models are subtly perturbed to alter their interpretations while maintaining performance. However, existing model manipulation methods have two key limitations: (1) they manipulate either synthetic or natural feature visualizations individually, but not both simultaneously, and (2) no work has studied whether circuit-based interpretations are vulnerable to such manipulations. This paper exposes these vulnerabilities by proposing a novel attack called ProxPulse that simultaneously manipulates both types of feature visualizations. Surprisingly, we find that visual circuits exhibit some robustness to ProxPulse. We therefore introduce CircuitBreaker, the first attack targeting entire circuits, which successfully manipulates circuit interpretations, revealing that circuits also lack robustness. The effectiveness of these attacks is validated across a range of pre-trained models, from smaller architectures like AlexNet to medium-scale models like ResNet-50, and larger ones such as ResNet-152 and DenseNet-201 on ImageNet. ProxPulse changes both visualization types with <1\% accuracy drop, while our CircuitBreaker attack manipulates visual circuits with attribution correlation scores dropping from near-perfect to ~0.6 while preserving circuit head functionality.
Submission Length: Regular submission (no more than 12 pages of main content)
Supplementary Material: pdf
Assigned Action Editor: ~Satoshi_Hara1
Submission Number: 6243
Loading