Neural Tree Decoder for Interpretation of Vision Transformers

Sangwon Kim; Byoung Chul Ko

Neural Tree Decoder for Interpretation of Vision Transformers

Sangwon Kim, Byoung Chul Ko

Published: 01 Jan 2024, Last Modified: 15 Sept 2025IEEE Trans. Artif. Intell. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this study, we propose a novel vision transformer neural tree decoder (ViT-NeT) that is interpretable and highly accurate in terms of fine-grained visual categorization (FGVC). A ViT acts as a backbone, and to overcome the limitations of ViT, the output context image patch is fed to the proposed NeT. NeT aims to more accurately classify fine-grained objects using similar interclass correlations and different intra-class correlations. ViT-NeT can also describe decision-making processes and visually interpret the results through tree structures and prototypes. Because the proposed ViT-NeT is designed not only to improve FGVC classification performance, but also to provide human-friendly interpretation, it is effective in resolving the tradeoff between performance and interpretability. We compared the performance of ViT-NeT with other state-of-the-art (SoTA) methods using the widely applied FGVC benchmark datasets CUB-200-2011, Stanford Dogs, Stanford Cars, NABirds, and iNaturalist. The proposed method shows a promising quantitative and qualitative performance in comparison to previous SoTA methods as well as an excellent interpretability.

Loading