Abstract: Target Speaker Extraction (TSE) in real-world multi-speaker environments is highly challenging. Previous works have largely relied on pre-enrollment speech to extract the target speaker’s voice. However, such methods are limited in spontaneous scenarios where pre-enrollment speech or spatial information is unavailable. To address this, we propose Semantic and Spatial Dual Querying (SSDQ), a unified framework that integrates natural language descriptions and region-based spatial queries to guide Target Speaker Extraction (TSE). SSDQ employs dual query encoders for semantic and spatial cues, fusing them into the audio stream via a FiLM-based interaction module. A novel Controllable Feature Wrapping (CFW) mechanism further enables a dynamic balance between speaker identity and acoustic clarity. We also introduce SS-Libri, a spatialized mixture dataset designed to benchmark dual-query systems. Extensive experiments demonstrate that SSDQ achieves superior extraction accuracy and robustness under challenging conditions, yielding the SI-SNRi of 19.63 dB, SNRi of 20.30 dB, PESQ of 1.83, and STOI of 0.26.
External IDs:doi:10.1109/lsp.2025.3591408
Loading