Keywords: Spoken Language Understanding, Automatic Speech Recognition
TL;DR: In Spoken Language Understanding, the benefit of using N-best alternatives is usually attributed to Error rate of the ASR. However, we show that the diversity of the alternatives is important.
Abstract: In Conversational AI, an Automatic Speech Recognition (ASR) system is used to transcribe the user's speech, and the output of the ASR is passed as an input to a Spoken Language Understanding (SLU) system, which outputs semantic objects (such as intent, slot-act pairs, etc.). Recent work, including the state-of-the-art methods in SLU utilize either Word lattices or N-Best Hypotheses from the ASR. The intuition given for using N-Best instead of 1-Best is that the hypotheses provide extra information due to errors in the transcriptions of the ASR system, i.e., the performance gain is attributed to the word-error-rate (WER) of the ASR. We empirically show that the gain in using N-Best hypotheses is not related to WER but related to the diversity of hypotheses.