Visually-Informed Multichannel Sound Source Separation Based on 3D Gaussian Primitives

Haruaki Asano, Ryunosuke Nihei, Yoshiaki Bando, Aditya Arie Nugraha, Diego Di Carlo, Hiroyuki Ueda, Yosuke Ito, Kazuyoshi Yoshii

Published: 2025, Last Modified: 15 May 2026APSIPA 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This paper proposes visually-informed sound source separation for audio-visual understanding of indoor scenes captured by distributed microphone arrays and cameras. Our approach leverages the 3D information of sound-emitting objects, reconstructed via 3D Gaussian splatting (3DGS), to overcome a limitation of modern blind source separation methods like multichannel nonnegative matrix factorization (MNMF). While adaptable and potentially performant, the iterative optimization of MNMF often converges to poor local minima due to the highly-expressive full-rank spatial covariance matrices (SCMs) of sources. Our key idea is to treat the set of 3D Gaussians representing a sizable sound source object as a collection of sub-sources that share an audio signal but have unique emission weights, both of which are to be estimated jointly from an observed mixture. To enforce this structure, we guide MNMF by regularizing the SCM of each source object at each frequency. Specifically, we use a prior that centers the SCM estimate around a weighted sum of theoretical SCMs, which are analytically derived from the 3D Gaussian positions. Experiments with simulated data, featuring two 3D human models, demonstrated the effectiveness of the proposed method. To our knowledge, this is the first work to use 3D Gaussians as a common primitive for joint audio-visual analysis.

External IDs:dblp:conf/apsipa/AsanoNBNCUIY25