SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks

Published: 02 Mar 2026, Last Modified: 06 Mar 2026ICLR 2026 Trustworthy AIEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mechanistic Interpretability, Model Editing, Feature Visualization, Representation Learning
TL;DR: We introduce a framework that uses a sparse autoencoder for precise, permanent, and continuous weight editing in CNNs and ViTs, offering fine-grained control over model behavior
Abstract: Deep neural networks achieve impressive performance but remain difficult to interpret and control. We present **SALVE** (Sparse Autoencoder-Latent Vector Editing), a framework bridging mechanistic interpretability and model editing. Using an $\ell_1$-regularized autoencoder, we learn a sparse, model-native feature basis without supervision. We validate these features with Grad-FAM, a feature-level saliency mapping method that visually grounds latent features in input data. Leveraging the autoencoder's structure, we perform precise and permanent weight-space interventions, enabling continuous modulation of both class-defining and cross-class features. We further derive a critical suppression threshold, $\alpha_{\text{crit}}$, quantifying each class’s reliance on its dominant feature to support fine-grained robustness diagnostics. Our approach is validated on convolutional (ResNet-18) and transformer-based (ViT-B/16) models, demonstrating consistent, interpretable control. This work contributes a principled methodology for turning feature discovery into actionable model edits, advancing the development of transparent and controllable AI systems.
Submission Number: 38
Loading