Navigating Audio-Visual Event Detection Across Mismatched ModalitiesDownload PDFOpen Website

Published: 01 Jan 2022, Last Modified: 19 Feb 2024ICASSP 2022Readers: Everyone
Abstract: Previous audio-visual (AV) alignment mainly focuses on frame-level synchronization while neglecting clip-wise matching. We focus on AV parsing on fully unconstrained data where the audio and visual events do not necessarily co-present. A video-enhanced Audioset dataset is provided to investigate parsing on such a mismatching setting, with 376 events included. To our knowledge, this is the first time where AV event parsing and detection are inspected on a clip-wise matching scenario. Experiments show that our proposed method largely improves video parsing accuracy on tagging and detection. Further, a parsing model pre-trained on our dataset can assist in accurately locating audio-visual syncing time spans.
0 Replies

Loading