4-bit Quantization of LSTM-based Speech Recognition Models

Chia-Yu Chen, Andrea Fasoli, Mauricio J. Serrano, Xiao Sun, Naigang Wang, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Wei Zhang, Zoltan Tuske, Kailash Gopalakrishnan

07 Jul 2021 (modified: 07 Jul 2021)OpenReview Archive Direct UploadReaders: Everyone

Abstract: We investigate the impact of aggressive low-precision rep- resentations of weights and activations in two families of large LSTM-based architectures for Automatic Speech Recognition (ASR): hybrid Deep Bidirectional LSTM - Hidden Markov Mod- els (DBLSTM-HMMs) and Recurrent Neural Network - Trans- ducers (RNN-Ts). Using a 4-bit integer representation, a na ̈ıve quantization approach applied to the LSTM portion of these mod- els results in significant Word Error Rate (WER) degradation. On the other hand, we show that minimal accuracy loss is achiev- able with an appropriate choice of quantizers and initializations. In particular, we customize quantization schemes depending on the local properties of the network, improving recognition perfor- mance while limiting computational time. We demonstrate our solution on the Switchboard (SWB) and CallHome (CH) test sets of the NIST Hub5-2000 evaluation. DBLSTM-HMMs trained with 300 or 2000 hours of SWB data achieves <0.5% and <1% average WER degradation, respectively. On the more challeng- ing RNN-T models, our quantization strategy limits degradation in 4-bit inference to 1.3%.

0 Replies