Abstract: Spoken Language Understanding (SLU) is an essential part of voice and speech assistant tools. End-to-End (E2E) SLU models attempt to automatically extract semantic meanings from the speech signal without the need for an intermediate transcription of speech. However, SLU is a challenging task mainly due to the lack of labeled, in-domain, and multilingual datasets. The Spoken Task-Oriented Semantic Parsing (STOP) dataset tries to address this problem and is the most extensive public dataset for the SLU task. This paper demonstrates our contribution to the Spoken Language Understanding Grand Challenge at ICASSP 2023. The fundamental idea of the proposed model is to utilize the pre-trained HuBERT model as an encoder alongside a transformer decoder with layer-drop and ensemble learning. The combination of HuBERT large encoder and a base transformer decoder obtained the best results, with an Exact Match (EM) accuracy of 75.05% on the STOP dataset. Ensemble decoding improved the accuracy to 75.92%.
Loading