Coordinate and Generalize: A Unified Framework for Audio-Visual Zero-Shot LearningDownload PDF

Published: 01 Feb 2023, Last Modified: 13 Feb 2023ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone
Abstract: Audio-Visual Zero-Shot Learning (AV-ZSL) aims to train a model that can classify videos of unseen classes leveraging audio and visual data, which is achieved by transferring knowledge obtained from seen classes. We identify two imperative issues needed to be addressed: (1) \emph{How to effectively exploit both the audio and visual information?} and (2) \emph{How to transfer the knowledge from seen classes to unseen classes?} In this paper, we ameliorate both of the issues in a unified framework by enhancing two ingredients that existing methods seldom consider. (1) \emph{Multi-Modal Coordination}: Different from existing methods simply fusing the audio and visual features by attention mechanism, we further perform knowledge distillation between the visual and audio branches. This allows information interaction between the two branches and encourages them to learn from each other. (2) \emph{Generalization Capacity}: Existing methods only consider the alignment between the audio-visual features and semantic features on the seen classes, which ignores the generalization capacity. Inspired by the interpretability methods of Deep Neural Networks (DNNs), we propose a novel gradient-based approach to generate transferable masks for the visual and audio features, enforcing the model to focus on the most discriminative segments and benefiting knowledge transfer from seen to unseen classes. Extensive experiments on three challenging benchmarks, ActivityNet-GZSL, UCF-GZSL, and VGGSound-GZSL, demonstrate that our proposed approach can significantly outperform the state-of-the-art methods.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
9 Replies

Loading