Abstract: Model quantization has been a popular approach to trade accuracy for speed during model serving, especially since new models are getting bigger and bigger. Traditionally, low-precision quantized models have better inference speed than their high-precision counterparts. However, the story may change with the advent of new machine learning instructions in modern processors. In this paper, we make a case for that using ARM processors. Our quantitative analysis shows that low-precision quantized models can be inferior in both accuracy and inference speed on modern processors. In response, we present, MiL, a new quantized neural network package. Operators in MiL are optimized for inferences based on the newly available machine learning instructions on the target platform. Experiments show that serving neural networks using PyTorch with MiL outperforms all state-of-the-art, including Riptide, TFLite with Ruy, and PyTorch with QNNPACK.
Loading