Abstract: Recently, visual transformers have shown promising results in tasks such as image classification, segmentation, object detection, etc. The explanation of their decision remains a challenge. This paper focuses on exploiting self-attention for an explanation. We propose a generalized interpretation of the transformers i.e model agnostic but class-specific explanations. The main principle is in the use and weighting self-attention maps of a visual transformer. To evaluate it, we use the popular hypothesis that an explanation is good if it correlates with human perception of a visual scene. Thus, the method has been evaluated against the Gaze Fixation Density Maps obtained in a psycho-visual experiment on a public database. It has been compared with other popular explainers such as Grad-Cam, LRP, Rollout, and Adaptive Relevance methods. The proposed method outperforms the best baseline by 2% in a standard Pearson Correlation Coefficient (PCC) metric.
0 Replies
Loading