UniX-Encoder: A Universal X-Channel Speech Encoder for AD-HOC Microphone Array Speech Processing

Zili Huang, Yiwen Shao, Shi-Xiong Zhang, Dong Yu

Published: 2024, Last Modified: 05 Nov 2025ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The speech field is evolving to solve more challenging scenarios, such as multi-channel recordings with multiple simultaneous talkers. In response to the diversity of microphone configurations in use, we introduce the UniX-Encoder, a universal encoder for multi-channel speech recordings. The UniX-Encoder is versatile, catering to a variety of speech tasks, and seamlessly integrates with any microphone array, whether in single-talker or multi-talker environments. Our research enhances previous multi-channel speech processing efforts in four aspects: 1) Adaptability: Contrasting traditional models constrained to certain microphone array configurations, our encoder is universally compatible. 2) Multi-Task Capability: Contrasting previous systems that were designed for single-task applications, the UniX-Encoder serves as a versatile upstream model, capable of extracting features for diverse speech tasks. 3) Self-Supervised Training: The UniX-Encoder is pretrained without the need for labeled multi-channel data. 4) End-to-End Integration: In contrast to models that first beamform then process single-channels, our encoder offers an end-to-end solution, bypassing explicit beamforming or separation. To validate its effectiveness, we tested the UniX-Encoder on a synthetic multi-channel dataset from the LibriSpeech corpus. Across various tasks, including ASR and speaker diarization, our encoder consistently outperformed combinations such as the WavLM model with the BeamformIt frontend.

External IDs:dblp:conf/icassp/HuangSZ024