Semi-Supervised Speech-Language Joint Pre-Training for Spoken Language UnderstandingDownload PDF

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Withdrawn SubmissionReaders: Everyone
Keywords: joint pre-training, multimodal representation learning, spoken language understanding, speech representation learning
Abstract: Spoken language understanding (SLU) requires a model to analyze input acoustic signals to understand its linguistic content and make predictions. To boost the models' performance, various pre-training methods have been proposed to utilize large-scale unlabeled text and speech data. However, the inherent disparities between the two modalities necessitate a mutual analysis. In this paper, we propose a novel semi-supervised learning method, AlignNet, to jointly pre-train the speech and language modules. Besides a self-supervised masked language modeling of the two individual modules, AlignNet aligns representations from paired speech and transcripts in a shared latent semantic space. Thus, during fine-tuning, the speech module alone can produce representations carrying both acoustic information and contextual semantic knowledge. Experimental results verify the effectiveness of our approach on various SLU tasks. For example, AlignNet improves the previous state-of-the-art accuracy on the Spoken SQuAD dataset by 6.2%.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
One-sentence Summary: We propose a joint pre-training framework for speech and text to learn speech representations that capture both the acoustic information and the contextual semantic knowledge of the input speech signal.
Reviewed Version (pdf): https://openreview.net/references/pdf?id=t7zQQPc7E1
4 Replies

Loading