FocusViT: Faithful Explanations for Vision Transformers via Gradient-Guided Layer-Skipping
Abstract: Vision Transformers (ViTs) have emerged as powerful alternatives to CNNs for image recognition, yet their token-based, attention-driven architecture makes interpreting their predictions challenging. Existing explainability methods like Grad-CAM and Attention Rollout either fail to capture hierarchical semantic information or assume attention directly reflects importance, often leading to misleading or diffuse explanations. We propose FocusViT, a novel explainability framework that integrates gradient-weighted attention attribution with dynamic, faithfulness-driven layer aggregation. By fusing attention maps with class-specific gradients and introducing per-head dynamic weighting, FocusViT highlights not only where the model attends but also how sensitive the prediction is to those attentions. Furthermore, our adaptive layer-skipping strategy ensures that only semantically meaningful layers contribute to the final explanation, enhancing both faithfulness and clarity. Extensive quantitative and qualitative evaluations on diverse benchmarks demonstrate that FocusViT consistently outperforms existing methods in faithfulness, sparsity, robustness, and class sensitivity, providing sharper and more reliable visual explanations for ViTs.
Submission Number: 650
Loading