Abstract: This paper introduces a novel speech separation technique that leverages the clustering of time-frequency (T-F) bin patches or raw speech blocks. Our approach integrates traditional graph-based clustering objectives with deep neural networks, enabling effective and scalable speech separation. By extracting features from T-F bin patches or raw speech blocks using a pre-trained encoder, we apply deep modularization for clustering, allowing us to identify clusters dominated by individual speakers in mixed speech signals. Extensive evaluations across multiple datasets, such as WSJ0-2mix and WHAM!, demonstrate the competitiveness of our method compared to fully supervised state-of-the-art speech separation models. In particular, our approach excels in separating complex acoustic mixtures without the need for parallel datasets and effectively mitigates the problem of permutation ambiguity, making it well-suited for real-world applications in multi-speaker environments.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Tatiana_Likhomanenko1
Submission Number: 3481
Loading