A Progressive Placeholder Learning Network for Multimodal Zero-Shot Learning

Zhuopan Yang, Zhenguo Yang, Xiaoping Li, Yi Yu, Qing Li, Wenyin Liu

Published: 01 Jan 2024, Last Modified: 20 May 2025IEEE Trans. Multim. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: It is challenging to eliminate the domain shift between seen and unseen classes in multimodal zero-shot learning tasks due to the underlying disparity between the data distributions in the seen and unseen domains. In this paper, we propose a progressive placeholder learning network with mixup hallucination and an alternating mixer, denoted as MHAM, to maintain embedding spaces for unseen classes. Utilizing mixup hallucination (MH) on the visual and textual features obtained by BERT and a vision transformer, MHAM generates visual and textual hallucinated representations with pseudo class embeddings as placeholders for the unseen classes. Furthermore, a number of alternating mixer (AM) blocks are stacked to obtain modality-shared representations for the seen classes and hallucinated representations of progressive placeholders for the unseen classes. In particular, modality-shared representations are obtained by a mixer in an AM block by reversing the dimensionality of the modality-specific and raw representations to model intermodal interactions. MHAM exploits a freezing strategy by fixing the weights over the unseen classes in the last fully connected layer; this step acts as a projection from the raw and modality-shared representations to the embedding space of the seen and unseen classes. Experiments conducted on zero-shot datasets and news event datasets demonstrate the superior performance of the proposed MHAM method.