Multimodal Disentangled Fusion Network via VAEs for Multimodal Zero-Shot Learning

Published: 01 Jan 2025, Last Modified: 05 Nov 2025IEEE Trans. Comput. Soc. Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Addressing the bias problem in multimodal zero-shot learning tasks is challenging due to the domain shift between seen and unseen classes, as well as the semantic gap across different modalities. To tackle these challenges, we propose a multimodal disentangled fusion network (MDFN) that unifies the class embedding space for multimodal zero-shot learning. MDFN exploits feature disentangled variational autoencoder (FD-VAE) in two branches to distangle unimodal features into modality-specific representations that are semantically consistent and unrelated, where semantics are shared within classes. In particular, semantically consistent representations and unimodal features are integrated to retain the semantics of the original features in the form of residuals. Furthermore, multimodal conditional VAE (MC-VAE) in two branches is adopted to learn cross-modal interactions with modality-specific conditions. Finally, the complementary multimodal representations achieved by MC-VAE are encoded into a fusion network (FN) with a self-adaptive margin center loss (SAMC-loss) to predict target class labels in embedding forms. By learning the distance among domain samples, SAMC-loss promotes intraclass compactness and interclass separability. Experiments on zero-shot and news event datasets demonstrate the superior performance of MDFN, with the harmonic mean improved by 27.2% on the MMED dataset and 5.1% on the SUN dataset.
Loading