Cross-modal Contrastive Learning for Speech TranslationDownload PDF

Anonymous

17 Dec 2021 (modified: 05 May 2023)ACL ARR 2021 December Blind SubmissionReaders: Everyone
Abstract: How to learn similar representations for spoken utterances and their written text? We believe a unified and aligned representation of speech and text will lead to improvement in speech translation. To this end, we propose ConST, a cross-modal contrastive learning method for end-to-end speech-to-text translation. We evaluate ConST and a variety of previous baselines on multiple language directions (En-De/Fr/Ru) of a popular benchmark MuST-C. Experiments show that the proposed ConST consistently outperforms all previous methods, and achieves the state-of-the-art average BLEU of 28.5. The analysis further verifies that ConST indeed closes the representation gap of different modalities --- its learned representation improves the accuracy of cross-modal text retrieval from 4% to 88%.
Paper Type: long
Consent To Share Data: yes
0 Replies

Loading