Multi-Attentional Distance for Zero-Shot Classification with Text-to-Image Diffusion Model

Published: 01 Jan 2024, Last Modified: 13 Nov 2024ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Text-to-image diffusion models have demonstrated rich visual-linguistic capability. However, existing image classification methods based on diffusion models simply choose the best-predicted noise, not exploiting the relationships between visual elements and text adequately. To this end, we propose a novel Multi-attentional Distance Classifier (MDC) by exploring some beneficial information in diffusion models. Specifically, MDC joints self- and cross-attention maps to model the semantic and structural distances of the latent variables of images under different category conditions, measuring the relevance between images and categories. With two types of distances integrated, we can classify image's category with minimum distance. We evaluate MDC on CIFAR-10, STL-10, and CIFAR-100 datasets under the zero-shot setting and it shows MDC achieves superior performance to prior works. Further experiments prove that by introducing attention in the diffusion process, MDC can discover key semantic and structure information of categories among images. Codes are publicly available at https://github.com/Carlofkl/MDC.
Loading