Abstract: Highlights • We propose an embedding-based attentional model for zero-shot image recognition. • Our method combines multi-scale visual attention (VA) and attribute selection (AS). • The two parts, VA and AS, are optimized in a unified supervised framework. • Our method can discover the relationships between visual and semantic spaces. • Experimental results show our method performs better than other related methods. Abstract Observing the phenomenon that the discriminative visual features and unambiguous attribute descriptions are important in zero-shot learning (ZSL), we propose a Multi-scale Visual Attention for Attribute Disambiguation (MVAAD). MVAAD contains a Multi-Scale Visual Attention Network (MSVAN) to realize attentions on image regions, which helps MVAAD to learn more discriminative visual features. Based on the multi-scale visual features in MSVAN, we also develop a Coarse-to-fine Visual-guided Attribute Selection Module (CVASM) to use the multi-scale visual attentive features for attribute disambiguation. Both of MSVAN and CVASM can be jointly trained in an end-to-end manner by minimizing the visual-semantic classification loss and the latent-visual contrastive triplet loss. Experimental results on four popular ZSL benchmarks, AwA2, CUB, SUN and FLO, illustrate that MVAAD is able to not only achieve the state-of-the-art performance, but also give meaningful and explainable visualizations on the visual attention and the attribute selection.
0 Replies
Loading