Abstract: This paper presents a method to detect off-screen sounds based on a loss function of self-supervised audio-visual spatialization. Audio source separation is a long-standing problem, and audio-visual source separation has recently attracted attention in the field of acoustic signal processing, image processing and machine learning. Although visual information in a video has important cues for audio source separation, it does not have information about sounds whose sources are not found in the video. The proposed method detects such off-screen sounds by focusing on correspondences between sounds and their sources in the video. For finding the correspondences, we use audio-visual spatialization, which converts mono audio into spatial audio with visual guidance. Experimental results show that the proposed method can achieve high accuracy in the task of off-screen sound detection.
0 Replies
Loading