A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis

DIPANJYOTI PAUL; Arpita Chowdhury; Xinqi Xiong; Feng-Ju Chang; David Edward Carlyn; Samuel Stevens; Kaiya L Provost; Anuj Karpatne; Bryan Carstens; Daniel Rubenstein; Charles Stewart; Tanya Berger-Wolf; Yu Su; Wei-Lun Chao

A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis

DIPANJYOTI PAUL, Arpita Chowdhury, Xinqi Xiong, Feng-Ju Chang, David Edward Carlyn, Samuel Stevens, Kaiya L Provost, Anuj Karpatne, Bryan Carstens, Daniel Rubenstein, Charles Stewart, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao

Published: 16 Jan 2024, Last Modified: 24 Apr 2024ICLR 2024 posterEveryoneRevisionsBibTeX

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Explainability, Interpretability, Transformer, Fine-grained recognition, Attribute discovery

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: Transformer based Interpretable Image recognition where each query in the decoder will learn class specific features.

Abstract: We present a novel usage of Transformers to make image classification interpretable. Unlike mainstream classifiers that wait until the last fully connected layer to incorporate class information to make predictions, we investigate a proactive approach, asking each class to search for itself in an image. We realize this idea via a Transformer encoder-decoder inspired by DEtection TRansformer (DETR). We learn ''class-specific'' queries (one for each class) as input to the decoder, enabling each class to localize its patterns in an image via cross-attention. We name our approach INterpretable TRansformer (INTR), which is fairly easy to implement and exhibits several compelling properties. We show that INTR intrinsically encourages each class to attend distinctively; the cross-attention weights thus provide a faithful interpretation of the prediction. Interestingly, via ''multi-head'' cross-attention, INTR could identify different ''attributes'' of a class, making it particularly suitable for fine-grained classification and analysis, which we demonstrate on eight datasets. Our code and pre-trained models are publicly accessible at the Imageomics Institute GitHub site: https://github.com/Imageomics/INTR.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Primary Area: visualization or interpretation of learned representations

Submission Number: 7979

Loading