Keywords: Spoken language processing, Multi-modal SLU, Encoder fusion
TL;DR: we propose to combine pretrained speech and text encoders via cross-attention, and we show the application of the proposed architecture in multiple spoken language processing systems
Abstract: Spoken Language Processing tasks that extract information from speech signal, have the advantage of using both speech and text modalities. In this paper, we propose to combine pretrained speech and text encoders via cross-attention, and we show the application of the proposed architecture in multiple spoken language processing systems. Our results indicate that it's more efficient to re-purpose previously trained independent modality encoders and learn only cross-attention from scratch. This resultant architecture captures both acoustic and lexical information, and performs text tagging while attending to speech encoder for improved results. We use compact pretrained speech and text encoder which are resource efficient and can be trained on a single consumer GPU card.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
6 Replies
Loading