Align2Concept: Language Guided Interpretable Image Recognition by Visual Prototype and Textual Concept Alignment

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Most works of interpretable neural networks strive for learning the semantics concepts merely from single modal information such as images. However, humans usually learn semantic concepts from multiple modalities and the semantics is encoded by the brain from fused multi-modal information. Inspired by cognitive science and vision-language learning, we propose a Prototype-Concept Alignment Network (ProCoNet) for learning visual prototypes under the guidance of textual concepts. In the ProCoNet, we have designed a visual encoder to decompose the input image into regional features of prototypes, while also developing a prompt generation strategy that incorporates in-context learning to prompt large language models to generate textual concepts. To align visual prototypes with textual concepts, we leverage the multimodal space provided by the pre-trained CLIP as a bridge. Specifically, the regional features from the vision space and the cropped regions of prototypes encoded by CLIP reside on different but semantically highly correlated manifolds, i.e. follow a multi-manifold distribution. We transform the multi-manifold distribution alignment problem into optimizing the projection matrix by Cayley transform on the Stiefel manifold. Through the learned projection matrix, visual prototypes can be projected into the multimodal space to align with semantically similar textual concept features encoded by CLIP. We conducted two case studies on the CUB-200-2011 and Oxford Flower dataset. Our experiments show that the ProCoNet provides higher accuracy and better interpretability compared to the single-modality interpretable model. Furthermore, ProCoNet offers a level of interpretability not previously available in other interpretable methods.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: This work significantly contributes to multimodal processing by advancing the integration of visual and textual modalities through the Prototype-Concept Alignment Network (ProCoNet). ProCoNet enhances the learning of visual prototypes guided by textual descriptions, bridging cognitive science with practical multimodal learning applications. By designing a visual encoder that decomposes images into regional features, and developing a prompt generation strategy that leverages in-context learning for large language models, this approach innovatively utilizes the multimodal space provided by CLIP. It aligns semantically rich visual features with textual concepts within a unified framework, transforming the alignment challenge into an optimization problem on the Stiefel manifold. This ensures a deeper, more interpretable fusion of multimodal data, setting a new standard in the field. The demonstrated improvements in accuracy and interpretability over single-modality models, as evidenced by our studies on the CUB-200-2011 and Oxford Flower datasets, underscore ProCoNet's potential to facilitate more nuanced and effective multimedia processing strategies, enhancing both academic research and practical applications in the field. This alignment capability introduces a new level of interpretability that is crucial for applications requiring robust and explainable multimodal integrations.
Supplementary Material: zip
Submission Number: 1601
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview