Machine learning-based efficient audio production separation method

Wenzhu Zhang, Byung-Gyu Kim

Published: 2025, Last Modified: 25 Jan 2026Int. J. Mach. Learn. Cybern. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Audio production separation, extracting individual sound sources from a mixture signal, has numerous applications in audio processing, audio remixing, and hearing aids. However, most existing methods only utilize spectral information and neglect spatial cues available in multi-microphone setups, limiting their performance. This paper proposes a novel audio production separation algorithm combining hyperdirectional beamforming and a long short-term memory (LSTM) network to exploit spatial and spectral information for efficient multi-speaker audio production separation. The hyperdirectional beamforming enhances target audio signals from desired directions while suppressing interference. The enhanced signals are processed by an LSTM network that predicts time-frequency masks for separating individual sources using a multi-task learning objective. Extensive experiments on simulated and real-world datasets demonstrate the superiority of the proposed algorithm over benchmark algorithms in terms of objective metrics across various acoustic conditions. Subjective listening tests with human participants further validate the proposed algorithm’s improved perceptual quality and intelligibility. An ablation study highlights the importance of both hyperdirectional beamforming and LSTM components, as well as their synergistic effect. The proposed algorithm offers a practical approach for exploiting spatial and spectral information in multi-speaker audio production separation, with potential applications in teleconferencing, hearing aids, and audio signal processing.

External IDs:dblp:journals/mlc/ZhangK25