Character Error Rate Estimation for Automatic Speech Recognition of Short Utterances

Published: 01 Jan 2024, Last Modified: 14 Mar 2025EUSIPCO 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The quality of an automatic speech recognition (ASR) system's output can be measured by comparing it with a gold standard reference. Evaluating an error rate (ER) is costly and therefore not always possible. Instead, one can aim to provide estimates for quality, without explicit reference. Prior work has concentrated on confidence scoring or word error rate (WER) estimation. The latter is typically model based, and it was found that the performance of a WER estimation model degrades when it is trained on short utterances. To address this issue this work presents an ER estimation model using character error rate (CER), called Fe-CER. The ER estimation model for ASR system's output employs character-level tokenisation for higher resolution on relatively short utterances. Fe-CER is compared with other ER estimation models using phonemes, byte-pair encoding tokens as well as words. The performance of the models is measured using normalised root mean square error (nRMSE), which takes into consideration the different distributions of target ERs. Fe-CER trained on Chime5 is shown to outperform the baseline model using word error rate in nRMSE and PCC by 6.00% and 8.79% relative, respectively.
Loading