Abstract: We present a novel constrained learning method for hybrid autoregressive transducer (HAT) models that results in more validated language model (LM) adaptation. LM adaptation in HAT is justified only when the transducer logits and the sum of speech and text logits in the label estimation sub-networks are approximately the same. The mean squared error (MSE) between the two logits was added to the HAT loss to encourage the HAT models to satisfy the required condition. The proposed method exhibited significantly lower and more stable internal language model perplexities than those of HAT. Consequently, it attained lower word error rates (WERs) compared to HAT in various model architecture settings and in both cases with and without LM adaptation. In the television content task, the proposed method achieved a relative reduction in WERs of up to 28.60% compared to HAT. In most cases, the accuracy of pre-trained HAT models also improved upon training with the additional MSE loss.
External IDs:dblp:conf/interspeech/LeeKJPH23
Loading