Quantization-aware Optimization Approach for CNNs Inference on CPUs

Published: 01 Jan 2024, Last Modified: 09 Apr 2025ASPDAC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Data movements through the memory hierarchy are a fundamental bottleneck in the majority of convolutional neural network (CNN) deployments on CPUs. Loop-level optimization and hybrid bitwidth quantization are two representative optimization approaches for memory access reduction. However, they were carried out independently because of the significantly increased complexity of design space exploration. We present QAOpt, a quantization-aware optimization approach that can reduce the high complexity when combining both for CNN deployments on CPUs. We develop a bitwidth-sensitive quantization strategy that can perform the trade-off between model accuracy and data movements when deploying both loop-level optimization and mixed precision quantization. Also, we provide a quantization-aware pruning process that can reduce the design space for high efficiency. Evaluation results demonstrate that our work can achieve better energy efficiency under acceptable accuracy loss.
Loading