Cross-modal Contrastive Learning for Speech TranslationDownload PDF

Anonymous

08 Mar 2022 (modified: 05 May 2023)NAACL 2022 Conference Blind SubmissionReaders: Everyone
Paper Link: https://openreview.net/forum?id=zmxg59rhm3D
Paper Type: Long paper (up to eight pages of content + unlimited references and appendices)
Abstract: How can we learn unified representations for spoken utterances and their written text? Learning similar representations for semantically similar speech and text is important for speech translation. To this end, we propose ConST, a cross-modal contrastive learning method for end-to-end speech-to-text translation. We evaluate ConST and a variety of previous baselines on a popular benchmark MuST-C. Experiments show that the proposed ConST consistently outperforms the previous methods, and achieves an average BLEU of 29.4. The analysis further verifies that ConST indeed closes the representation gap of different modalities --- its learned representation improves the accuracy of cross-modal speech-text retrieval from 4% to 88%. Code and models are available at https://github.com/ReneeYe/ConST.
Presentation Mode: This paper will be presented in person in Seattle
Virtual Presentation Timezone: UTC+8
Copyright Consent Signature (type Name Or NA If Not Transferrable): Rong Ye
Copyright Consent Name And Address: Rong Ye, ByteDance AI Lab, Shanghai Business Park,Building 24, Zone B,1999 Yishan Road, Minhang District, Shanghai, China.
0 Replies

Loading