Unimodal-Multimodal Collaborative Enhancement for Audio-Visual Event Localization

Published: 01 Jan 2023, Last Modified: 06 Mar 2025PRCV (6) 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Audio-visual event localization (AVE) task focuses on localizing audio-visual events where event signals occur in both audio and visual modalities. Existing approaches primarily emphasize multimodal (i.e. audio-visual fused) feature processing to capture high-level event semantics, while overlooking the potential of unimodal (i.e. audio or visual) features in distinguishing unimodal event segments where only the visual or audio event signal appears within the segment. To overcome this limitation, we propose the Unimodal-Multimodal Collaborative Enhancement (UMCE) framework for audio-visual event localization. The framework consists of several key steps. Firstly, audio and visual features are enhanced by multimodal features, and then adaptively fused to further enhance the multimodal features. Simultaneously, the unimodal features collaborate with multimodal features to filter unimodal events. Lastly, by considering the collaborative emphasis on event content at both the segment and video levels, a dual interaction mechanism is established to exchange information, and video features are utilized for event classification. Experimental results demonstrate the significant superiority of our UMCE framework over state-of-the-art methods in both supervised and weakly supervised AVE settings.
Loading