Multimodal Audio-textual Architecture for Robust Spoken Language UnderstandingDownload PDF

Anonymous

17 Sept 2021 (modified: 05 May 2023)ACL ARR 2021 September Blind SubmissionReaders: Everyone
Abstract: Tandem spoken language understanding (SLU) systems suffer from the so-called automatic speech recognition (ASR) error propagation problem. Additionally, as the ASR is not optimized to extract semantics, but solely the linguistic content, relevant semantic cues might be left out of its transcripts. In this work, we propose a multimodal language understanding (MLU) architecture to mitigate these problems. Our solution is based on two compact unidirectional long short-term memory (LSTM) models that encode speech and text information. A fusion layer is also used to fuse audio and text embeddings. Two fusion strategies are explored: a simple concatenation of these embeddings and a cross-modal attention mechanism that learns the contribution of each modality. The first approach showed to be the optimal solution to robustly extract semantic information from audio-textual data. We found that attention is less effective at testing time when the text modality is corrupted. Our model is evaluated on three SLU datasets and robustness is tested using ASR outputs from three off-the-shelf ASR engines. Results show that the proposed approach effectively mitigates the ASR error propagation problem for all datasets.
0 Replies

Loading