In Pursuit of Babel - Multilingual End-to-End Spoken Language Understanding

Published: 01 Jan 2021, Last Modified: 21 May 2025ASRU 2021EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: End-to-end spoken language understanding (E2E SLU) systems predict the utterance semantics directly from speech. So far, to the best of our knowledge, E2E models have only been trained to recognize the semantics for a single language. In this work we introduce the first multilingual E2E SLU system and present results across three languages - English, Spanish and French. We propose a transformer-based, multilingual acoustic encoder to predict intents, that leverages pre-training for both acoustic and linguistic modalities of the SLU model. It learns a robust, cross-modal latent space using a pre-trained multilingual BERT as a semantic teacher. The best performing model achieves relative improvements of 7.2% in a single language setting, 5-6% in two, and 4-6% in three language settings. An intent-wise analysis shows that semantic supervision becomes more important for shorter utterances, while providing an explicit language identifier at the input leads to lower intent classification errors.
Loading