SpecRec: An Alternative Solution for Improving End-to-End Speech-to-Text Translation via Spectrogram Reconstruction

Junkun Chen, Mingbo Ma, Renjie Zheng, Liang Huang

2021 (modified: 24 Apr 2023)Interspeech 2021Readers: Everyone

Abstract: End-to-end Speech-to-text Translation (E2E-ST), which directly translates source language speech to target language text, is widely useful in practice, but traditional cascaded approaches (ASR+MT) often suffer from error propagation in the pipeline. On the other hand, existing end-to-end solutions heavily depend on the source language transcriptions for pre-training or multi-task training with Automatic Speech Recognition (ASR). We instead propose a simple technique to learn a robust speech encoder in a self-supervised fashion only on the speech side, which can utilize speech data without transcription. This technique termed Spectrogram Reconstruction (SpecRec), learns better speech representation via recovering the missing speech frames and provides an alternative solution to improving E2E-ST. We conduct our experiments over 8 different translation directions. In the setting without using any transcriptions, our technique achieves an average improvement of +1.1 BLEU. SpecRec also improves the translation accuracy with +0.7 BLEU over the baseline in speech translation with ASR multitask training setting.

0 Replies