Multimodal Audio-textual Architecture for Robust Spoken Language UnderstandingDownload PDF

Anonymous

16 Feb 2022 (modified: 05 May 2023)ACL ARR 2022 February Blind SubmissionReaders: Everyone
Abstract: Tandem spoken language understanding (SLU) systems suffer from the so-called automatic speech recognition (ASR) error propagation. In this work, we investigate how such problem impacts state-of-the-art NLU models such as BERT (Bidirectional Encoder Representations from Transformers) and RoBERTa. Moreover, a multimodal language understanding (MLU) system is proposed to mitigate SLU performance degradation due to error present in ASR transcripts. Our solution combines an encoder network to embed audio signals and the state-of-the-art BERT to process text transcripts. A fusion layer is also used to fuse audio and text embeddings. Two fusion strategies are explored: a pooling average of probabilities from each modality and a similar scheme with a fine-tuning step. The first approach showed to be the optimal solution to extract semantic information when the text input is severely corrupted whereas the second approach was slightly better when the quality of ASR transcripts was higher. We found that as the quality of ASR transcripts decayed the performance of BERT and RoBERTa also decayed, compromising the overall SLU performance, whereas the proposed MLU showed to be more robust towards poor quality ASR transcripts. Our model is evaluated on five tasks from three SLU datasets with different complexity levels, and robustness is tested using ASR outputs from three ASR engines. Results show that the proposed approach effectively mitigates the ASR error propagation problem across all datasets.
Paper Type: long
0 Replies

Loading