Abstract: Weakly-supervised audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances as well as identify the corresponding event categories with only video-level category labels for training. Most previous efforts have been devoted to refining the supervision for each modality or extracting fruitful cross-modality information for more reliable feature learning. None of them have noticed the imbalanced feature learning between different modalities in the task. In this paper, to balance the feature learning processes of different modalities, a dynamic gradient modulation (DGM) mechanism is explored, where a novel and effective metric function is designed to measure the imbalanced feature learning between audio and visual modalities. Furthermore, by going in depth into the principle of traditional WS-AVVP pipelines, two additional challenges are identified: confusing multimodal calculation will hamper the precise measurement of audio-visual imbalanced feature learning, as well as the global supervision provided by video-level labels can not provide explicit guidance for robust semantic feature learning in each action subspace. To cope with the above issues, the modality-separated decision unit (MSDU) and semantic-aware feature extractor (SAFE) are designed for precise measurement of imbalanced feature learning and unambiguous semantic-aware feature extraction separately. Comprehensive experiments are conducted on public benchmarks and the corresponding experimental results demonstrate the effectiveness of our proposed method.
Loading