CCSRD: Content-Centric Speech Representation Disentanglement Learning for End-to-End Speech Translation

Xiaohu Zhao; Haoran Sun; Yikun Lei; shaolin Zhu; Deyi Xiong

CCSRD: Content-Centric Speech Representation Disentanglement Learning for End-to-End Speech Translation

Xiaohu Zhao, Haoran Sun, Yikun Lei, shaolin Zhu, Deyi Xiong

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 FindingsEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Machine Translation

Submission Track 2: Speech and Multimodality

Keywords: speech translation, representation disentanglement

Abstract: Deep neural networks have demonstrated their capacity in extracting features from speech inputs. However, these features may include non-linguistic speech factors such as timbre and speaker identity, which are not directly related to translation. In this paper, we propose a content-centric speech representation disentanglement learning framework for speech translation, CCSRD, which decomposes speech representations into content representations and non-linguistic representations via representation disentanglement learning. CCSRD consists of a content encoder that encodes linguistic content information from the speech input, a non-content encoder that models non-linguistic speech features, and a disentanglement module that learns disentangled representations with a cyclic reconstructor, feature reconstructor and speaker classifier trained in a multi-task learning way. Experiments on the MuST-C benchmark dataset demonstrate that CCSRD achieves an average improvement of +0.9 BLEU in two settings across five translation directions over the baseline, outperforming state-of-the-art end-to-end speech translation models and cascaded models.

Submission Number: 3160

Loading