Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge

Dongjin Kim, Sung Jin Um, Sangmin Lee, Jung Uk Kim

Published: 01 Jan 2024, Last Modified: 25 Jun 2025CVPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The goal of the multi-sound source localization task is to localize sound sources from the mixture individu-ally. While recent multi-sound source localization meth-ods have shown improved performance, they face chal-lenges due to their reliance on prior information about the number of objects to be separated. In this paper, to overcome this limitation, we present a novel multi-sound source localization method that can perform localization without prior knowledge of the number of sound sources. To achieve this goal, we propose an iterative object iden-tification (101) module, which can recognize sound-making objects in an iterative manner. After finding the regions of sound-making objects, we devise object similarity-aware clustering (OSC) loss to guide the 101 module to effectively combine regions of the same object but also dis-tinguish between different objects and backgrounds. It enables our method to perform accurate localization of sound-making objects without any prior knowledge. Exten-sive experimental results on the MUSIC and VGGSound benchmarks show the significant performance improve-ments of the proposed method over the existing methods for both single and multi-source. Our code is available at: https://github.comNisuaIAIKHUINoPrior_MultiSSL.