ArrayDPS: Unsupervised Blind Speech Separation with a Diffusion Prior

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We show that we are able to apply diffusion posterior sampling for unsupervised, array-agnostic, and generative multi-channel blind speech separation.
Abstract: Blind Speech Separation (BSS) aims to separate multiple speech sources from audio mixtures recorded by a microphone array. The problem is challenging because it is a blind inverse problem, i.e., the microphone array geometry, the room impulse response (RIR), and the speech sources, are all unknown. We propose ArrayDPS to solve the BSS problem in an unsupervised, array-agnostic, and generative manner. The core idea builds on diffusion posterior sampling (DPS), but unlike DPS where the likelihood is tractable, ArrayDPS must approximate the likelihood by formulating a separate optimization problem. The solution to the optimization approximates room acoustics and the relative transfer functions between microphones. These approximations, along with the diffusion priors, iterate through the ArrayDPS sampling process and ultimately yield separated voice sources. We only need a simple single-speaker speech diffusion model as a prior, along with the mixtures recorded at the microphones; no microphone array information is necessary. Evaluation results show that ArrayDPS outperforms all baseline unsupervised methods while being comparable to supervised methods in terms of SDR. Audio demos and codes are provided at: https://arraydps.github.io/ArrayDPSDemo/ and https://github.com/ArrayDPS/ArrayDPS.
Lay Summary: With multiple voice sources mixed and recorded by microphones, how to separate different sources? Intuitively, if we know what a single speaker's speech sounds like, how would that help with separation? Is it possible to design an algorithm that would allow source separation for any microphone array, without needing any extra model training? To solve this problem, we use a diffusion model that models the pattern of single-speaker speech. Then, with this prior information about single-speaker speech, we design a novel posterior sampling algorithm for multi-microphone source separation. We enforce the separation result to follow the single-speaker speech pattern modeled by the diffusion model. Our finding shows that without any supervision, our method can achieve superior source separation, only using a speech diffusion prior model. The model can easily generalize to any microphone array and is generative.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/ArrayDPS/ArrayDPS
Primary Area: Applications->Language, Speech and Dialog
Keywords: Diffusion posterior sampling, speech separation, microphone array processing
Submission Number: 3282
Loading