ConSLU: Constrained Decoding for Enhanced Spoken Language Understanding in Joint End-to-End Models

Dinh-Truong Do, Minh-Phuong Nguyen, Le-Minh Nguyen

Published: 01 Jan 2024, Last Modified: 05 Aug 2025KSE 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Spoken language understanding (SLU) commonly employs cascading systems, integrating Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) modules. However, these systems often suffer from information loss, latency, and high costs, prompting significant research interest in end-to-end (E2E) SLU. Among E2E methods, joint models have emerged as particularly effective in terms of latency and accuracy. However, prior works on joint E2E SLU often represent the output logical form as a sequence of tokens, lacking a guarantee of producing a correct logical form. In this study, we enhance the joint E2E SLU approach by simplifying the output sequence and constraining the decoding process to focus on candidate tokens. Specifically, we categorize tokens in the logical form into label tokens and normal tokens, applying constrained candidates for each token type. Through experiments on the STOP dataset, our method outperforms previous works (by 1.44 exact match improvement compared to the baseline), achieving a 78% exact match score, demonstrating its effectiveness.

External IDs:dblp:conf/kse/DoNN24