More is Less - Byte-quantized models are faster than bit-quantized models on the edge

Pengfei Zhang, Chenxia Han, Eric Lo

Published: 2022, Last Modified: 13 May 2025IEEE Big Data 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Model quantization has been a popular approach to trade accuracy for speed during model serving, especially since new models are getting bigger and bigger. Traditionally, low-precision quantized models have better inference speed than their high-precision counterparts. However, the story may change with the advent of new machine learning instructions in modern processors. In this paper, we make a case for that using ARM processors. Our quantitative analysis shows that low-precision quantized models can be inferior in both accuracy and inference speed on modern processors. In response, we present, MiL, a new quantized neural network package. Operators in MiL are optimized for inferences based on the newly available machine learning instructions on the target platform. Experiments show that serving neural networks using PyTorch with MiL outperforms all state-of-the-art, including Riptide, TFLite with Ruy, and PyTorch with QNNPACK.