OpenAVE: Moving towards Open Set Audio-Visual Event Localization

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Audio-Visual Event (AVE) Localization aims to identify and classify video segments that are both audible and visible, a field that has seen substantial progress in recent years. Existing methods operate under a closed-set assumption and struggle to recognize unknown events in open-world scenarios. To better adapt to real-life applications, we introduce the Open Set Audio-Visual Event Localization task and propose a novel and effective network called OpenAVE based on evidential deep learning. To the best of our knowledge, this is the first effort to address this challenge. Our approach encompasses deep evidential AVE classification and event-relevant prediction, targeting the nuanced demands of open-set environments. Our approach includes deep evidential AVE classification and event-relevant prediction. The deep evidential AVE classification manages event classification uncertainty by extracting class evidence from segment-specific representations enriched with multi-scale context. To effectively distinguish between unknown events and background segments, event-relevant prediction utilizes positive-unlabeled learning. Futhermore, a learnable Gaussian-prior prediction branch is adopted to enhance the performance of event-relevant prediction. Experimental results demonstrate that OpenAVE significantly outperforms state-of-the-art models on the Audio-Visual Event dataset, confirming the effectiveness of our proposed method.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Media Interpretation
Relevance To Conference: Video, image and audio are all main elements of multimedia data. Combining videos and the corresponding audio information could support our to comprehend our surrounding things and events that took place. In order to meet the actual needs in the life, a novel video-audio event localization task was proposed in the open-set scenario, which could not only identifies known video-audio event class and accurately searches the boundary of event instances, but also rejects unknown events in the given videos.
Supplementary Material: zip
Submission Number: 323
Loading