Speech-MLP: a simple MLP architecture for speech processing

Chao Xing; Dong Wang; Lirong Dai; Qun Liu; Anderson Avila

Speech-MLP: a simple MLP architecture for speech processing

Chao Xing, Dong Wang, Lirong Dai, Qun Liu, Anderson Avila

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 SubmittedReaders: Everyone

Keywords: MLP, transformers, speech signal processing

Abstract: Overparameterized transformer-based architectures have shown remarkable performance in recent years, achieving state-of-the-art results in speech processing tasks such as speech recognition, speech synthesis, keyword spotting, and speech enhancement et al. The main assumption is that with the underlying self-attention mechanism, transformers can ultimately capture the long-range temporal dependency from speech signals. In this paper, we propose a multi-layer perceptron (MLP) architecture, namely speech-MLP, useful for extracting information from speech signals. The model splits feature channels into non-overlapped chunks and processes each chunk individually. The processed chunks are then merged together and processed to consolidate the output. By setting the different numbers of chunks and focusing on different contextual window sizes, speech-MLP learns multiscale local temporal dependency. The proposed model is successfully evaluated on two tasks: keyword spotting and speech enhancement. In our experiments, we use two benchmark datasets for keyword spotting (Google speech command V2-35 and LibriWords) and the VoiceBank dataset for the speech enhancement task. In all experiments, speech-MLP surpassed transformer-based solutions, achieving state-of-the-art performance with fewer parameters and simpler training schemes. Such results indicate that oftentimes more complex models such as transformers are not necessary for speech processing tasks. Hence, they should not be considered as the first option as simpler and more compact models can offer optimal performance.

One-sentence Summary: A pure MLP architecture with new feature partition methods by simple training methods outperforms elaborate designed transformer-based models in speech signal processing tasks.

18 Replies

Loading