Module-Based End-to-End Distant Speech Processing: A case study of far-field automatic speech recognition [Special Issue On Model-Based and Data-Driven Audio Signal Processing]

Xuankai Chang, Shinji Watanabe, Marc Delcroix, Tsubasa Ochiai, Wangyou Zhang, Yanmin Qian

Published: 01 Jan 2024, Last Modified: 14 May 2025IEEE Signal Process. Mag. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Distant speech processing is a critical downstream application in speech and audio signal processing. Traditionally, researchers have addressed this challenge by breaking it down into distinct subproblems and encompassing the extraction of clean speech signals from noisy inputs, feature extraction, and transcription. This approach led to the development of modular distant automatic speech recognition (DASR) models, which are often designed with multiple stages in cascade, corresponding to specific subproblems. Recently, the surge in the capabilities of deep learning is propelling the popularity of purely end-to-end (E2E) models that employ a single large neural network to tackle an entire DASR task in an extremely data-driven manner. However, an alternative paradigm persists in the form of a modular model design, where we can often leverage speech and signal processing models. Although this approach mirrors the multistage model, it is trained through an E2E process. This article overviews the recent development of DASR systems, focusing on E2E module-based models and showcasing successful downstream applications of model-based and data-driven audio signal processing.