Domain Robust, Fast, and Compact Neural Language Models

Alexander Gerstenberger, Kazuki Irie, Pavel Golik, Eugen Beck, Hermann Ney

Published: 2020, Last Modified: 19 Feb 2025ICASSP 2020EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Despite advances in neural language modeling, obtaining a good model on a large scale multi-domain dataset still remains a difficult task. We propose training methods for building neural language models for such a task, which are not only domain robust, but reasonable in model size and fast for evaluation. We combine knowledge distillation from pretrained domain expert language models with the noise contrastive estimation (NCE) loss. Knowledge distillation allows to train a single student model which is both compact and domain robust, while the use of NCE loss makes the model self-normalized, which enables fast evaluation. We conduct experiments on a large English multi-domain speech recognition dataset provided by AppTek. The resulting student model is of the size of one domain expert, while it gives similar perplexities as various teacher models on their expert domain; the model is self-normalized, allowing for 30% faster first pass decoding than the naive models which require the full soft- max computation, and finally it gives improvements of more than 8% relative in terms of word error rate over a large multidomain 4-gram count model trained on more than 10 B words.