Prototypical Part Transformer for Interpretable Image Recognition

Published: 01 Jan 2025, Last Modified: 31 Jul 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Prototypical Part Networks facilitate interpretable decision-making processes, with the classification score computed by comparing test image patches to learned prototypes. Existing work typically learns prototypes from Convolutional Neural Networks (CNNs). However, learning prototypical parts directly in Vision Transformers (ViTs) results in fragmented and noisy prototype activations. To address this, we quantify the dispersion of prototypes’ responsive regions with the Diffusion Index (DI). Subsequently, we propose Prototypical Part Transformer (PPTformer), an interpretable model designed to refine prototype learning in ViTs by introducing distinct prototypical branches, either involving the CLS token or not. In PPTformer, the prototype space is defined by orthogonal class-aware prototype vectors, ensuring disentanglement and informativeness. Additionally, classaware activation refinement is introduced to focus attention and reduce DI. Extensive experiments demonstrate that PPTformer outperforms state-of-the-art prototypical learning methods and its non-interpretable counterparts, providing faithful local and global explanations.
Loading