Abstract: In this paper, we propose a multimodal conditional variational auto-encoder (MC-VAE) in two branches to achieve a unified real-world event embedding space for zero-shot event discovery. More specifically, given multimodal data, Vision Transformer is exploited to extract global and local visual features, and BERT is adopted to obtain high-level semantic textual features. Furthermore, the textual MC-VAE and visual MC-VAE are designed to learn complementary multimodal representations. By using textual features as conditions, the textual MC-VAE encodes visual features to conform to textual semantics. Similarly, the visual MC-VAE encodes textual features in accordance with visual semantics using visual features as conditions. In particular, the textual MC-VAE and visual MC-VAE exploit MSE loss to keep visual and textual semantics for learning complementary multimodal representations, respectively. Finally, the complementary multimodal representations achieved by MC-VAE in two branches are integrated to predict real-world event labels in embedding forms, which provides feedback to finetune the Vision Transformer in turn. Experiments conducted on real-world datasets and zero-shot datasets show the outperformance of the proposed MC-VAE.
Loading