Multimodal Semantic Fusion for Zero-Shot Learning

YAN FENG, Tian Jiang, Yunqi Liu, Zijian Huang, Xiaohui Cui

Published: 29 Sept 2024, Last Modified: 23 Jan 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Zero-shot learning (ZSL) aims to predict classes that have never appeared in the training stage. The key to addressing those unseen classes is to transfer knowledge from seen classes, utilizing semantic information. However, existing semantic information suffers from incomplete semantics, which leads to severe domain shaft problems between real and synthetic data. In this paper, we propose a novel two-phase framework to fundamentally augment generative ZSL frameworks via the combination of multimodal semantics. Specifically, we design an unsupervised algorithm to effectively fuse two kinds of semantics: human-defined attributes (HA) and commonsense embeddings (CSE) from ConceptNet. To further fine-tune the accuracy, we apply a cascade classification network consisting of two prototype classifiers to leverage the unseen samples with high confidence. Extensive experiments show that our method can outperform state-of-the-art methods on multiple datasets with significant improvements, and can be a plug-in to augment generative ZSL frameworks.