Object-Aware Image Augmentation for Audio-Visual Zero-Shot Learning

Yujie Dong, Shiming Chen, Bowen Duan, Weiping Ding, Yisong Wang, Xinge You

Published: 2025, Last Modified: 20 Mar 2026IEEE Trans. Emerg. Top. Comput. Intell. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Audio-visual zero-shot learning (ZSL) leverages both video and audio information for model training, aiming to classify new video categories that were not seen during the training. However, existing methods often failed to learn robust multi-modal feature representations because they overlook the importance of object-aware images. Moreover, these methods require complete modal information to operate effectively, which limits their performance in resource-constrained environments. To address these issues, this paper proposes an Object-Aware Image Augmentation Network (OAIA) for audio-visual ZSL. OAIA introduces a Cross-Modal Feature Augmentation (CMFA) subnet and a Missing Modality Generation (MMG) subnet to enhance feature representations and generate virtual features for missing modalities. Specifically, the CMFA subnet uses attention mechanisms to integrate and enhance the features of video, audio, and object-aware images, providing a richer training signal for the model and promoting the learning of more diverse and discriminative multi-modal representations. The MMG subnet employs a multi-layer perceptron to generate virtual features for missing modalities based on existing modal information, ensuring that the model can operate effectively even when modal data is incomplete during testing. Extensive experiments have demonstrated the effectiveness and superiority of our proposed method.
Loading