Audio-Visual Clustering for 3D Speaker Localization

Vasil Khalidov, Florence Forbes, Miles E. Hansard, Elise Arnaud, Radu Horaud

Published: 2008, Last Modified: 12 May 2023MLMI 2008Readers: Everyone

Abstract: We address the issue of localizing individuals in a scene that contains several people engaged in a multiple-speaker conversation. We use a human-like configuration of sensors (binaural and binocular) to gather both auditory and visual observations. We show that the localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data to a representation of the common 3D scene-space, via a pair of Gaussian mixture models. Inference is performed by a version of the Expectation Maximization algorithm, which provides cooperative estimates of both the activity (speaking or not) and the 3D position of each speaker.

0 Replies