Faithful Attribution in Vision Transformers via Feature-Gradient Gating

Faithful Attribution in Vision Transformers via Feature-Gradient Gating

CVPR 2026 Workshop HOW Proceedings Track Submission22 Authors

Published: 21 Mar 2026, Last Modified: 23 May 2026HOW 2026EveryoneRevisionsBibTeXCC BY 4.0

Include In Proceedings: Yes, include in CVPR proceedings

Public: Yes,

Keywords: Vision Transformers, Attribution Methods, Faithfulness, Interpretability, Attention-Based Explanations, Sparse Autoencoders, Feature-Level Interpretability, Mechanistic Interpretability

TL;DR: We gate ViT attention attribution with Sparse Autoencoder features, improving faithfulness while decomposing relevance into interpretable semantic concepts.

Abstract: Attention-based attribution methods like TransMM identify where a Vision Transformer attends but not which internal features drive the prediction. Sparse Autoencoders (SAEs) can decompose ViT activations into interpretable feature dictionaries, yet their signals have not been integrated directly into attribution mechanisms. We propose feature-gradient gating: residual-stream gradients are projected onto SAE decoder directions, combined with feature activations to score patches, and used as multiplicative gates on TransMM's gradient-weighted attention before relevance propagation. The resulting per-patch scores decompose linearly into per-feature contributions, enabling inspection of which learned features drive each region's relevance. Across chest X-ray, endoscopy, and natural-image benchmarks, feature-gradient gating consistently improves Faithfulness Correlation and Salience-guided Faithfulness Coefficient (SaCo) over vanilla TransMM, with smaller or mixed gains on Pixel Flipping.

PDF: pdf

Submission Number: 22

Loading