A Proposal-based Paradigm for Self-supervised Sound Source Localization in VideosDownload PDF

05 Nov 2022OpenReview Archive Direct UploadReaders: Everyone
Abstract: Humans can easily recognize where and how the sound is produced via watching a scene and listening to corresponding audio cues. To achieve such cross-modal perception on machines, existing methods only use the maps generated by interpolation operations to localize the sound source. As semantic object-level localization is more attractive for potential practical applications, we argue that these existing map-based approaches only provide a coarse-grained and indirect description of the sound source. In this pa-per, we advocate a novel proposal-based paradigm that can directly perform semantic object-level localization, without any manual annotations. We incorporate the global re-sponse map as an unsupervised spatial constraint to weight the proposals according to how well they cover the esti-mated global shape of the sound source. As a result, our proposal-based sound source localization can be cast into a simpler Multiple Instance Learning (MIL) problem by filtering those instances corresponding to large sound-unrelated regions. Our method achieves state-of-the-art (SOTA) per-formance when compared to several baselines on multiple datasets.
0 Replies

Loading