CMAST: Efficient Speech-Text Joint Training Method to Enhance Linguistic Features Learning of Speech Representations

Jingran Xie; Changhe Song; Yang Xiang; Hui Wang; Xixin Wu; Zhiyong Wu; Helen Meng

CMAST: Efficient Speech-Text Joint Training Method to Enhance Linguistic Features Learning of Speech Representations

Jingran Xie, Changhe Song, Yang Xiang, Hui Wang, Xixin Wu, Zhiyong Wu, Helen Meng

Published: 01 Jan 2024, Last Modified: 15 May 2025ISCSLP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Self-supervised pre-training on speech modality has exhibited promising performance in diverse domains. However, self-supervised learned speech representations often fail to sufficiently highlight linguistic content, limiting their effectiveness in situations that rely heavily on linguistic information, such as intent classification and automatic speech recognition tasks. To address this limitation, we introduce an efficient speech-text joint training method, referred to as CMAST, which leverages cross-modal alignment between speech and text pre-trained models. With a limited amount of paired speech and text data, the pre-trained model's ability to capture linguistic content is significantly enhanced. We conduct extensive evaluations on the SUPERB platform to verify the effectiveness of our approach. The results show that our CMAST model achieves superior performance over previous speech pre-trained models on a series of downstream tasks. In contrast to state-of-the-art speech-text joint pre-trained models, our CMAST model presents comparable performance while requiring fewer parameters and less training data.

Loading