CM-PIE: Cross-Modal Perception for Interactive-Enhanced Audio-Visual Video Parsing

Published: 01 Jan 2024, Last Modified: 16 May 2025ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Audio-visual video parsing is the task of categorizing a video with weak labels at the segment level, and predicting them as audible or visible events. Recent methods have leveraged the attention mechanism to capture the semantic correlations among the whole video across the audio-visual modalities. However, these approaches may overlook the importance of individual segments and their interrelations within a video, typically relying on a single modality when learning features. In this paper, we propose a novel interactive-enhanced cross-modal perception method (CM-PIE), which can learn fine-grained features by applying a segment-based attention module. In addition, a cross-modal aggregation block is introduced to jointly optimize the semantic representation of audio and visual signals by enhancing inter-modal interactions. Experimental results show that our model offers improved parsing performance on the Look, Listen, and Parse (LLP) dataset compared to other methods.
Loading