LICO: Language-Image COnsistent Learning by Manifold and Distribution Alignments
Abstract: Interpreting the decisions of deep learning models has been discussing for several years since the explosion of deep neural networks. One of the convincing approaches is attention map-based visual interpretation such as CAM, where the generation of attention maps depend merely on categorical labels used for cross-entropy loss. Although current interpretation methods can provide convincible decision clues, they do not consider richer information that human describe an image, yielding partial correspondence between image and attention maps. In this paper, we address this issue by correlating learnable language prompts with corresponding vision features through manifold learning and optimal transport (OT) theory. Specifically, we first minimize the KL-divergence between adjacent matrices of vision and text features to guarantee the consistent global manifold structure. Second, we apply OT to assign local feature maps with corresponding class-specific prompts, then generating fine-grained attention maps. Extensive experiments on eight datasets show that the proposed LICO helps generate more explainable attention maps combined with current interpretation methods such as Grad-CAM. In addition, LICO also facilitate the vanilla convolutional neural networks to achieve higher classification performances without introducing any computational overhead during inference.
0 Replies
Loading