Focus on What Matters: Guiding Vision Transformers Towards Justification

Thomas John, Mrinal Das

Published: 2025, Last Modified: 10 Nov 2025PAKDD (3) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Vision Transformers (ViTs) have recently gained significant traction in the domain of computer vision. It has been observed that during training for image classification tasks using category labels, ViTs naturally excel at efficiently extracting features that are strongly correlated with the labels. However, it is difficult to ensure human-like justification through such extracted features due to the lack of connection between the human perspective and the category labels. As a consequence, ViTs can be highly accurate regardless of whether these features are truly class-discriminatory or simply coincidental. This may result in a lack of justification from a human point of view, which is a crucial property, especially in high-risk scenarios. In this paper, we aim to address this limitation. It should be noted that explainability or interpretability can help people understand the model’s decisions; however, it does not guarantee justification. We argue that, in order to achieve assured justification from a human perspective, it is inevitable to make a connection between feature extraction and the human perspective. As a first attempt, we adopt a simple yet effective approach to imbibe human angle into feature extraction for ViT by leveraging additional guidance. However, in our approach, we could restrict this additional guidance to a very minimal level. We perform a series of experiments and validate the proposed method using both quantitative and qualitative results to demonstrate that we can provide assured justification without compromising accuracy. We also show that, as a bonus, our method requires less data and lower training costs, yet remains robust to contextual variations while maintaining accuracy.

External IDs:dblp:conf/pakdd/JohnD25