As We Speak: Real-Time Visually Guided Speaker Separation and Localization

Piotr Czarnecki, Jakub Tkaczuk

Published: 2022, Last Modified: 12 May 2023MMSP 2022Readers: Everyone

Abstract: Real-time speaker separation and localization is crucial to enable applications for video call enhancement, automatic subtitles localization, as well as spatial voice generation/panning. The common approach to perform speaker localization and separation is to detect candidate faces and then perform visual guided voice separation for each. There are two methods used for face detection: with face detector on static video frames [1], [2] or with audio visual sequence processing for active speaker detection [3]. In this work, we propose improvements for the visual guided speaker separation model to make it real-time. The described model follows the approach with a face detector. The model extends real-time models known for speech enhancement [4], [5] by adding face processing to ultimately perform visual guided speaker separation. Our system is lightweight with 0.6M trainable parameters. It performs speaker separation near instantaneously with the delay of a single input audio frame. To our knowledge, it is the first real-time system for visual guided speaker separation. From the application point of view it is important that the model performs both tasks at the time: speech separation and active speaker localization.

0 Replies