Keywords: ViT Interpretability, semi-supervised semantic segmentation
TL;DR: We reproduced a recently published work on ViT Interpretability and used these explainability cues to construct pseudo-segmentation mask using AffinityNet.
Abstract: Scope of Reproducibility: In this work, we experimented with Layer-wise Relevance Propagation and combined it with back-propagation to perform classification and semantic segmentation, following the approach proposed by Chefer H. et al., in (1) for computer vision. Moreover, we incorporated the concept of pixel affinities, by using ViT-based explainability as visual seeds to drive the generation of pseudo segmentation masks by computing pixel affinities, following the approach described by Ahn J. et al. in (2). Methodology: In order to reproduce the experiments presented in (1) and (2), we initially examined the authors’ code thoroughly and based on our understanding, we tried to replicate most parts of the pipeline apart from evaluation metrics for positive and negative perturbation area-under-curve (AUC) results for the predicted and target classes on the ImageNet (3) validation set, as well as Segmentation performance on the ImageNet-segmentation (4) dataset, which we borrowed from the authors’ repository for the work of Chefer H. et al., in (1). Regarding hardware, we used private resources to train our ViT-hybrid architecture and Affinity network, as well as perform inference for all our models; Finally, it took roughly 15 GPU hours to reproduce the vision-related results of (1) whereas it took about 40 GPU hours to train and evaluate the AffinityNet on the Hybrid-ViT architecture. Results: Overall, we reproduced the experiments related to the vision task as conducted at (1). Our results are up to the first decimal place identical to those reported in (1) thus supporting the authors’ claim of having implemented a relatively sufficient ViT interpretability method. When it comes to the AffinityNet (2), the method has been adapted in the context of Hybrid-ViT architectures with our experiments indicating that the weakly-supervised semantic segmentation performance of Hybrid-ViTarchitectures are inferior to the CNN-based ones. What was easy: We found particularly easy to run and understand the code provided by the original authors of both(1) and (2) papers. When it comes to replicating (1), the authors provided most of the information required to reproduce the vision-related experiments with the code compensating for what was missing. What was difficult: The main difficulty of replicating the study presented in (1) was that details on how to compute the AUC metric were not provided in the paper report.  Chefer, Hila, Shir Gur, and Lior Wolf. "Transformer interpretability beyond attention visualization." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.  Ahn, Jiwoon, and Suha Kwak. "Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
Paper Url: https://openaccess.thecvf.com/content/CVPR2021/papers/Chefer_Transformer_Interpretability_Beyond_Attention_Visualization_CVPR_2021_paper.pdf
Paper Venue: CVPR 2021
Supplementary Material: zip