Align2Concept: Language Guided Interpretable Image Recognition by Visual Prototype and Textual Concept Alignment
Abstract: Most works of interpretable neural networks strive for learning the semantics concepts merely from single modal information such as images. However, humans usually learn semantic concepts from multiple modalities and the semantics is encoded by the brain from fused multi-modal information. Inspired by cognitive science and vision-language learning, we propose a Prototype-Concept Alignment Network (ProCoNet) for learning visual prototypes under the guidance of textual concepts. In the ProCoNet, we have designed a visual encoder to decompose the input image into regional features of prototypes, while also developing a prompt generation strategy that incorporates in-context learning to prompt large language models to generate textual concepts. To align visual prototypes with textual concepts, we leverage the multimodal space provided by the pre-trained CLIP as a bridge. Specifically, the regional features from the vision space and the cropped regions of prototypes encoded by CLIP reside on different but semantically highly correlated manifolds, i.e. follow a multi-manifold distribution. We transform the multi-manifold distribution alignment problem into optimizing the projection matrix by Cayley transform on the Stiefel manifold. Through the learned projection matrix, visual prototypes can be projected into the multimodal space to align with semantically similar textual concept features encoded by CLIP. We conducted two case studies on the CUB-200-2011 and Oxford Flower dataset. Our experiments show that the ProCoNet provides higher accuracy and better interpretability compared to the single-modality interpretable model. Furthermore, ProCoNet offers a level of interpretability not previously available in other interpretable methods.
Loading